Estimating periodic signals#

Section author: Gavin Huttley, Julien Epps, Hua Ying

We consider two different scenarios:

  • estimating the periods in a signal

  • estimating the power for a given period

  • measuring statistical significance for the latter case

Estimating the periods in a signal#

For numerical (continuous) data#

We first make some sample data. A periodic signal and some noise.

import numpy

t = numpy.arange(0, 10, 0.1)
n = numpy.random.randn(len(t))
nse = numpy.convolve(n, numpy.exp(-t / 0.05)) * 0.1
nse = nse[: len(t)]
sig = numpy.sin(2 * numpy.pi * t) + nse

Discrete Fourier transform#

We now use the discrete Fourier transform to estimate periodicity in this signal. Given we set the period to equal 10, we expect the maximum power for that index.

from cogent3.maths.period import dft

pwr, period = dft(sig)
print(period)
print(pwr)
[  2.           2.04081633   2.08333333   2.12765957   2.17391304
   2.22222222   2.27272727   2.3255814    2.38095238   2.43902439
   2.5          2.56410256   2.63157895   2.7027027    2.77777778
   2.85714286   2.94117647   3.03030303   3.125        3.22580645
   3.33333333   3.44827586   3.57142857   3.7037037    3.84615385
   4.           4.16666667   4.34782609   4.54545455   4.76190476
   5.           5.26315789   5.55555556   5.88235294   6.25
   6.66666667   7.14285714   7.69230769   8.33333333   9.09090909
  10.          11.11111111  12.5         14.28571429  16.66666667
  20.          25.          33.33333333  50.         100.        ]
[ 1.06015801+0.00000000e+00j  0.74686707-1.93971914e-02j
  0.36784793-2.66370366e-02j  0.04384413+2.86970840e-02j
  1.54473269-2.43777386e-02j  0.28522968-2.33602932e-01j
 -0.09908167-8.09083592e-01j -0.17082184-3.71376271e-02j
 -0.89062721+6.28548281e-01j  0.7963791 -3.10035336e-01j
 -0.01840271+7.80492960e-01j  0.88651578-2.41151331e-01j
 -0.19530165-2.51775974e-01j -1.15022781+9.86446405e-01j
  1.08988438-4.80200034e-01j -0.43529315-2.96947078e-01j
 -0.02022101+5.21803410e-01j -0.85047145-1.15731607e+00j
 -1.43688846+1.25736906e-01j -0.24122185+2.99636684e-02j
  1.07237075+9.52126090e-02j -0.82213276+4.00931516e-01j
  0.13867235-2.81517725e-01j -0.49837198-2.77574035e-01j
 -0.46785187-6.22595745e-01j  1.11562242+9.49254299e-01j
  1.1923063 -1.71044818e-01j  1.13909144+4.13690934e-01j
  0.25347752+6.74139986e-02j -0.31153602+9.31846596e-02j
 -0.40472419-6.65748222e-01j -0.20432603-3.45226487e-01j
  0.7243685 -4.04775797e-01j  1.1815508 +4.80212265e-01j
  1.28830449-1.72824059e-01j -0.06062707+4.52301510e-01j
  0.02841505-4.00362778e-01j  0.54158937+1.05374577e+00j
  0.79192147+3.74568790e-01j  1.37803411+4.15026155e-01j
 -0.9707734 -5.07593113e+01j -0.00458776+1.01619914e+00j
  0.72674876+6.73631881e-01j -0.03109358+8.27252853e-01j
 -0.07118569-9.66787987e-01j  0.3787156 +8.78325775e-01j
  0.74358583+9.70454093e-01j  0.09857348+7.64330816e-01j
 -0.58344276-4.46465366e-01j -0.46042799+1.39206718e-01j]

The power (pwr) is returned as an array of complex numbers, so we convert into real numbers using abs. We then zip the power and corresponding periods and sort to identify the period with maximum signal.

pwr = abs(pwr)
max_pwr, max_period = sorted(zip(pwr, period))[-1]
print(max_pwr, max_period)
50.768593471914606 10.0

Auto-correlation#

We now use auto-correlation.

from cogent3.maths.period import auto_corr

pwr, period = auto_corr(sig)
print(period)
print(pwr)
[ 2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49
 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73
 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97
 98 99]
[ 1.63366075e+01 -1.47309007e+01 -3.99310414e+01 -4.94779387e+01
 -4.02111589e+01 -1.59402206e+01  1.38935958e+01  3.78082033e+01
  4.67917301e+01  3.80558458e+01  1.50298023e+01 -1.31498874e+01
 -3.58750562e+01 -4.44274029e+01 -3.59442269e+01 -1.42494835e+01
  1.23571124e+01  3.37998464e+01  4.19246214e+01  3.40423905e+01
  1.37305496e+01 -1.13403096e+01 -3.11639561e+01 -3.90121298e+01
 -3.19165151e+01 -1.29071920e+01  1.03347404e+01  2.88866727e+01
  3.61923373e+01  2.95836846e+01  1.19272046e+01 -9.60842692e+00
 -2.70680982e+01 -3.36897636e+01 -2.74936245e+01 -1.12432553e+01
  8.85493706e+00  2.50657815e+01  3.12033008e+01  2.55562690e+01
  1.03888477e+01 -8.05790735e+00 -2.29155248e+01 -2.87540424e+01
 -2.35507983e+01 -9.80308843e+00  7.14561492e+00  2.05380855e+01
  2.57732301e+01  2.10932490e+01  8.86322889e+00 -6.14559534e+00
 -1.80666743e+01 -2.27850512e+01 -1.87271137e+01 -7.81441259e+00
  5.36167153e+00  1.61436543e+01  2.02975923e+01  1.66247570e+01
  7.05138114e+00 -4.63658173e+00 -1.41895898e+01 -1.78396501e+01
 -1.46634150e+01 -6.32607996e+00  3.90027551e+00  1.21450065e+01
  1.54316580e+01  1.27121176e+01  5.63814161e+00 -3.03866397e+00
 -9.82623015e+00 -1.24702344e+01 -1.03525898e+01 -4.73098658e+00
  2.24573472e+00  7.60152097e+00  9.58897479e+00  7.92571369e+00
  3.56836331e+00 -1.61934286e+00 -5.69234791e+00 -7.21888050e+00
 -5.98178529e+00 -2.74044320e+00  9.40442056e-01  3.60242842e+00
  4.64669169e+00  3.90300288e+00  2.05204196e+00 -3.23056837e-02
 -1.50585679e+00 -2.11958968e+00 -1.84011202e+00 -1.16394993e+00
 -4.59262996e-01 -9.29507813e-02]

We then zip the power and corresponding periods and sort to identify the period with maximum signal.

max_pwr, max_period = sorted(zip(pwr, period))[-1]
print(max_pwr, max_period)
46.791730073275666 10

For symbolic data#

We create a sequence as just a string

s = (
    "ATCGTTGGGACCGGTTCAAGTTTTGGAACTCGCAAGGGGTGAATGGTCTTCGTCTAACGCTGG"
    "GGAACCCTGAATCGTTGTAACGCTGGGGTCTTTAACCGTTCTAATTTAACGCTGGGGGGTTCT"
    "AATTTTTAACCGCGGAATTGCGTC"
)

We then specify the motifs whose occurrences will be converted into 1, with all other motifs converted into 0. As we might want to do this in batches for many sequences we use a factory function.

from cogent3.maths.stats.period import SeqToSymbols

seq_to_symbols = SeqToSymbols(["AA", "TT", "AT"])
symbols = seq_to_symbols(s)
len(symbols) == len(s)
symbols
array([1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 1,
       1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0,
       0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,
       0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1, 1, 0,
       1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1,
       0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0], dtype=uint8)

We then estimate the integer discrete Fourier transform for the full data. To do this, we need to pass in the symbols from full conversion of the sequence. The returned values are the powers and periods.

from cogent3.maths.period import ipdft

powers, periods = ipdft(symbols)
powers
array([3.22082108e-14, 4.00000000e+00, 9.48683298e+00, 6.74585634e+00,
       3.46410162e+00, 3.20674669e+00, 4.13171488e+00, 1.21157382e+00,
       5.24998151e+00, 8.17918725e+00, 8.83176087e+00, 6.68973168e+00,
       2.30390506e+00, 5.77601816e+00, 7.39581554e+00, 9.80089521e+00,
       1.28140014e+01, 5.59887450e+00, 4.59868328e+00, 8.51990117e+00,
       6.87861072e+00, 2.21002402e+00, 2.99221455e+00, 7.02839146e+00,
       9.56255506e+00, 1.06605556e+01, 1.06350498e+01, 9.87505455e+00,
       8.75521377e+00, 7.58223742e+00, 6.57383339e+00, 5.86043328e+00,
       5.48856222e+00, 5.41967659e+00, 5.54831380e+00, 5.74584888e+00,
       5.89930756e+00, 5.92786864e+00, 5.78419759e+00, 5.45019553e+00,
       4.93168494e+00, 4.25398249e+00, 3.46051662e+00, 2.62061706e+00,
       1.86912424e+00, 1.51346882e+00, 1.83960376e+00, 2.57129507e+00,
       3.40313085e+00, 4.21317980e+00, 4.95053395e+00, 5.58947522e+00,
       6.11597773e+00, 6.52293480e+00, 6.80798765e+00, 6.97227258e+00,
       7.01955788e+00, 6.95558284e+00, 6.78752487e+00, 6.52356140e+00,
       6.17250947e+00, 5.74353209e+00, 5.24590403e+00, 4.68883280e+00,
       4.08133521e+00, 3.43218310e+00, 2.74997407e+00, 2.04355896e+00,
       1.32410691e+00, 6.22045261e-01, 3.65309295e-01, 1.00750601e+00,
       1.73887675e+00, 2.47484570e+00, 3.20342519e+00, 3.91921485e+00,
       4.61856189e+00, 5.29863885e+00, 5.95716246e+00, 6.59227384e+00,
       7.20247125e+00, 7.78656372e+00, 8.34363454e+00, 8.87301027e+00,
       9.37423335e+00, 9.84703756e+00, 1.02913256e+01, 1.07071489e+01,
       1.10946887e+01, 1.14542396e+01, 1.17861941e+01, 1.20910287e+01,
       1.23692910e+01, 1.26215890e+01, 1.28485807e+01, 1.30509645e+01,
       1.32294720e+01, 1.33848600e+01, 1.35179043e+01, 1.36293943e+01,
       1.37201278e+01, 1.37909065e+01, 1.38425325e+01, 1.38758049e+01,
       1.38915172e+01, 1.38904543e+01, 1.38733912e+01, 1.38410910e+01,
       1.37943034e+01, 1.37337637e+01, 1.36601922e+01, 1.35742932e+01,
       1.34767546e+01, 1.33682480e+01, 1.32494280e+01, 1.31209329e+01,
       1.29833845e+01, 1.28373882e+01, 1.26835338e+01, 1.25223958e+01,
       1.23545340e+01, 1.21804940e+01, 1.20008083e+01, 1.18159967e+01,
       1.16265675e+01, 1.14330180e+01, 1.12358361e+01, 1.10355008e+01,
       1.08324839e+01, 1.06272504e+01, 1.04202606e+01, 1.02119707e+01,
       1.00028347e+01, 9.79330530e+00, 9.58383576e+00, 9.37488120e+00,
       9.16690010e+00, 8.96035593e+00, 8.75571856e+00, 8.55346580e+00,
       8.35408477e+00, 8.15807318e+00, 7.96594040e+00, 7.77820828e+00,
       7.59541162e+00, 7.41809819e+00, 7.24682812e+00, 7.08217267e+00])
periods
array([  2,   3,   4,   5,   6,   7,   8,   9,  10,  11,  12,  13,  14,
        15,  16,  17,  18,  19,  20,  21,  22,  23,  24,  25,  26,  27,
        28,  29,  30,  31,  32,  33,  34,  35,  36,  37,  38,  39,  40,
        41,  42,  43,  44,  45,  46,  47,  48,  49,  50,  51,  52,  53,
        54,  55,  56,  57,  58,  59,  60,  61,  62,  63,  64,  65,  66,
        67,  68,  69,  70,  71,  72,  73,  74,  75,  76,  77,  78,  79,
        80,  81,  82,  83,  84,  85,  86,  87,  88,  89,  90,  91,  92,
        93,  94,  95,  96,  97,  98,  99, 100, 101, 102, 103, 104, 105,
       106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118,
       119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131,
       132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144,
       145, 146, 147, 148, 149])

We can also compute the auto-correlation statistic, and the hybrid (which combines IPDFT and auto-correlation).

from cogent3.maths.period import auto_corr, hybrid

powers, periods = auto_corr(symbols)
powers
array([11.,  9., 11.,  9.,  6.,  9.,  7.,  7.,  5.,  6., 11.,  9.,  9.,
        8., 10.,  9.,  4.,  7., 10., 12., 11.,  7.,  9.,  7.,  8.,  5.,
        8.,  9.,  4.,  7.,  7., 11.,  9., 10.,  8.,  7.,  6.,  5.,  5.,
        6.,  3.,  2.,  3.,  6.,  5.,  4.,  7.,  5.,  6.,  8.,  7.,  6.,
        4.,  8.,  5.,  5.,  4.,  5.,  7.,  6.,  3.,  4.,  5.,  5.,  4.,
        4.,  6.,  4.,  3.,  3.,  5.,  6.,  5.,  4.,  4.,  5.,  3.,  5.,
        5.,  5.,  3.,  2.,  5.,  6.,  6.,  6.,  6.,  6.,  4.,  3.,  3.,
        5.,  4.,  2.,  4.,  2.,  1.,  1.,  5.,  6.,  4.,  2.,  3.,  5.,
        5.,  5.,  6.,  5.,  5.,  3.,  3.,  3.,  2.,  2.,  3.,  2.,  1.,
        2.,  2.,  3.,  4.,  2.,  2.,  2.,  3.,  3.,  2.,  3.,  1.,  1.,
        0.,  1.,  0.,  0.,  0.,  1.,  1.,  1.,  0.,  1.,  1.,  1.,  0.,
        0.,  0.,  0.,  0.,  0.])
periods
array([  2,   3,   4,   5,   6,   7,   8,   9,  10,  11,  12,  13,  14,
        15,  16,  17,  18,  19,  20,  21,  22,  23,  24,  25,  26,  27,
        28,  29,  30,  31,  32,  33,  34,  35,  36,  37,  38,  39,  40,
        41,  42,  43,  44,  45,  46,  47,  48,  49,  50,  51,  52,  53,
        54,  55,  56,  57,  58,  59,  60,  61,  62,  63,  64,  65,  66,
        67,  68,  69,  70,  71,  72,  73,  74,  75,  76,  77,  78,  79,
        80,  81,  82,  83,  84,  85,  86,  87,  88,  89,  90,  91,  92,
        93,  94,  95,  96,  97,  98,  99, 100, 101, 102, 103, 104, 105,
       106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118,
       119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131,
       132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144,
       145, 146, 147, 148, 149])
powers, periods = hybrid(symbols)
powers
array([3.54290319e-13, 3.60000000e+01, 1.04355163e+02, 6.07127071e+01,
       2.07846097e+01, 2.88607202e+01, 2.89220041e+01, 8.48101672e+00,
       2.62499076e+01, 4.90751235e+01, 9.71493695e+01, 6.02075851e+01,
       2.07351455e+01, 4.62081453e+01, 7.39581554e+01, 8.82080569e+01,
       5.12560056e+01, 3.91921215e+01, 4.59868328e+01, 1.02238814e+02,
       7.56647179e+01, 1.54701681e+01, 2.69299309e+01, 4.91987402e+01,
       7.65004405e+01, 5.33027778e+01, 8.50803983e+01, 8.88754909e+01,
       3.50208551e+01, 5.30756620e+01, 4.60168337e+01, 6.44647661e+01,
       4.93970600e+01, 5.41967659e+01, 4.43865104e+01, 4.02209421e+01,
       3.53958454e+01, 2.96393432e+01, 2.89209880e+01, 3.27011732e+01,
       1.47950548e+01, 8.50796497e+00, 1.03815499e+01, 1.57237024e+01,
       9.34562120e+00, 6.05387527e+00, 1.28772263e+01, 1.28564753e+01,
       2.04187851e+01, 3.37054384e+01, 3.46537377e+01, 3.35368513e+01,
       2.44639109e+01, 5.21834784e+01, 3.40399383e+01, 3.48613629e+01,
       2.80782315e+01, 3.47779142e+01, 4.75126741e+01, 3.91413684e+01,
       1.85175284e+01, 2.29741283e+01, 2.62295201e+01, 2.34441640e+01,
       1.63253408e+01, 1.37287324e+01, 1.64998444e+01, 8.17423584e+00,
       3.97232072e+00, 1.86613578e+00, 1.82654647e+00, 6.04503603e+00,
       8.69438374e+00, 9.89938281e+00, 1.28137008e+01, 1.95960742e+01,
       1.38556857e+01, 2.64931942e+01, 2.97858123e+01, 3.29613692e+01,
       2.16074137e+01, 1.55731274e+01, 4.17181727e+01, 5.32380616e+01,
       5.62454001e+01, 5.90822254e+01, 6.17479537e+01, 6.42428932e+01,
       4.43787546e+01, 3.43627188e+01, 3.53585824e+01, 6.04551433e+01,
       4.94771639e+01, 2.52431781e+01, 5.13943227e+01, 2.61019291e+01,
       1.32294720e+01, 1.33848600e+01, 6.75895216e+01, 8.17763660e+01,
       5.48805111e+01, 2.75818129e+01, 4.15275974e+01, 6.93790246e+01,
       6.94575858e+01, 6.94522714e+01, 8.32403473e+01, 6.92054550e+01,
       6.89715168e+01, 4.12012911e+01, 4.09805767e+01, 4.07228795e+01,
       2.69535092e+01, 2.67364959e+01, 3.97482840e+01, 2.62418659e+01,
       1.29833845e+01, 2.56747763e+01, 2.53670676e+01, 3.75671874e+01,
       4.94181358e+01, 2.43609880e+01, 2.40016166e+01, 2.36319935e+01,
       3.48797024e+01, 3.42990539e+01, 2.24716721e+01, 3.31065025e+01,
       1.08324839e+01, 1.06272504e+01, 0.00000000e+00, 1.02119707e+01,
       0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 9.37488120e+00,
       9.16690010e+00, 8.96035593e+00, 0.00000000e+00, 8.55346580e+00,
       8.35408477e+00, 8.15807318e+00, 0.00000000e+00, 0.00000000e+00,
       0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00])
periods
array([  2,   3,   4,   5,   6,   7,   8,   9,  10,  11,  12,  13,  14,
        15,  16,  17,  18,  19,  20,  21,  22,  23,  24,  25,  26,  27,
        28,  29,  30,  31,  32,  33,  34,  35,  36,  37,  38,  39,  40,
        41,  42,  43,  44,  45,  46,  47,  48,  49,  50,  51,  52,  53,
        54,  55,  56,  57,  58,  59,  60,  61,  62,  63,  64,  65,  66,
        67,  68,  69,  70,  71,  72,  73,  74,  75,  76,  77,  78,  79,
        80,  81,  82,  83,  84,  85,  86,  87,  88,  89,  90,  91,  92,
        93,  94,  95,  96,  97,  98,  99, 100, 101, 102, 103, 104, 105,
       106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118,
       119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131,
       132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144,
       145, 146, 147, 148, 149])

Estimating power for specified period#

For numerical (continuous) data#

We just use sig created above. The Goertzel algorithm gives the same result as the dft.

from cogent3.maths.period import goertzel

pwr = goertzel(sig, 10)
print(pwr)
50.76859347191462

For symbolic data#

We use the symbols from the above example. For the ipdft, auto_corr and hybrid functions we just need to identify the array index containing the period of interest and slice the corresponding value from the returned powers. The reported periods start at llim, which defaults to 2, but indexes start at 0, the index for a period-5 is simply 5-llim.

powers, periods = auto_corr(symbols)
llim = 2
period5 = 5 - llim
periods[period5]
5
powers[period5]
9.0

For Fourier techniques, we can compute the power for a specific period more efficiently using Goertzel algorithm.

from cogent3.maths.period import goertzel

period = 4
power = goertzel(symbols, period)
ipdft_powers, periods = ipdft(symbols)
ipdft_power = abs(ipdft_powers[period - llim])
round(power, 6) == round(ipdft_power, 6)
power
9.486832980505154

It’s also possible to specify a period to the stand-alone functions. As per the goertzel function, just the power is returned.

power = hybrid(symbols, period=period)
power
104.35516278555667

Measuring statistical significance of periodic signals#

For numerical (continuous data)#

We use the signal provided above. Because significance testing is being done using a resampling approach, we define a calculator which precomputes some values to improve compute performance. For a continuous signal, we’ll use the Goertzel algorithm.

from cogent3.maths.period import Goertzel

goertzel_calc = Goertzel(len(sig), period=10)

Having defined this, we then just pass this calculator to the blockwise_bootstrap function. The other critical settings are the block_size which specifies the size of segments of contiguous sequence positions to use for sampling and num_reps which is the number of permuted replicate sequences to generate.

from cogent3.maths.stats.period import blockwise_bootstrap

obs_stat, p = blockwise_bootstrap(
    sig, calc=goertzel_calc, block_size=10, num_reps=1000
)
print(obs_stat)
print(p)
50.76859347191462
0.0

For symbolic data#

Permutation testing#

The very notion of permutation testing for periods, applied to a genome, requires the compute performance be as quick as possible. This means providing as much information up front as possible. We have made the implementation flexible by not assuming how the user will convert sequences to symbols. It’s also the case that numerous windows of exactly the same size are being assessed. Accordingly, we use a class to construct a fixed signal length evaluator. We do this for the hybrid metric first.

from cogent3.maths.period import Hybrid

hybrid_calculator = Hybrid(len(s), period=4)

Note

We defined the period length of interest in defining this calculator because we’re interested in dinucleotide motifs.

We then construct a seq-to-symbol convertor.

from cogent3.maths.stats.period import SeqToSymbols

seq_to_symbols = SeqToSymbols(["AA", "TT", "AT"], length=len(s))

The rest is as per the analysis using Goertzel above.

from cogent3.maths.stats.period import blockwise_bootstrap

stat, p = blockwise_bootstrap(
    s,
    calc=hybrid_calculator,
    block_size=10,
    num_reps=1000,
    seq_to_symbols=seq_to_symbols,
)
print(stat)
p < 0.1
104.35516278555667
True