Skip to content

Towards Markov Chains

DNA as a Markov chain

Several packages (e.g. MarkovChainsHammer.jl, DiscreteMarkovChains.jl, etc.) in the Julia ecosystem have been implemented to work with Markov chains with a state space of integers, those could be efficient in many ways, but they are clumsy to work with a specialized biological types as in the BioJulia ecosystem. Therefore, in the GeneFinder package we dedicated some implementations to work with BioSequence types so that we can expand the functionality in an efficient way (see complete API).

One important step towards many gene finding algorithms is to represent a DNA sequence as a Markov chain. In this representation a DNA sequence of a reduced alphabet A={A,C,G,T} is draw as a four-vertex graph, where each letter of A is a state (vertex) and the edges of the graph represent transitions from one nucleotide to another in a sequence (e.g. AT represent a single nucleotide to nucleotide transition). This is also considered more specifically as a Discrete Markov chain (Axelson-Fisk 2015). The complete set of transitions and states of a DNA sequence of alphabet A.

More formally a Markov chain is a random process where each state is a random variable Xt where tT is a discrete time in a finite sequence T and the probability to jump from one state into another is only dependent of the current state. Therefore a definition of this Markov property is given by:

P(Xt=j|Xt1=i)

where i,jA . This property led us to generalize a way to calculate the probability of a sequence T from a process (X1...XT) where each random variable is a nucleotide from A so that:

P(X1=i1,...,XT=iT)=P(X1=i1)t=2TP(Xt=it|Xt1=it1)

Note that previous equations has two terms, a initial probability P(X1=i1) and the the product of all transitions beginning at t=2. So, to calculate the initial probability distribution of each of the nucleotides of a string T with the alphabet 𝒜 we can first calculate the transition probability matrix M^ out of the frequency count of the transitions. In an alphabet 𝒜 we got 42 transitions of one order, that is the AA,AC,AG,... which coincides with the frequency of the dinucleotides in the sequence. So we can later in fact build a 4x4 matrix representing all the transitions. For instance in a DNA sequence T of 24 nucleotides:

CCTCCCGGACCCTGGGCTCGGGAC

We can calculate each frequency nucleotide to any other nucleotide m^ij=cijci where cij is the actual count of the dinucleotide, and therefore ci is the counts of the nucleotide i to any other nucleotide and build the transition probability matrix:

[ACGTA0.001.000.000.00C0.000.560.220.30G0.250.120.620.00T0.000.670.330.00]

It is noteworthy that initial probabilities can also be obtained from the counts of each nucleotide transitions cij over the total sum of the dinucleotide counts ck:

π^i=cikck

That way for the previous example example we can can calculate the initial probabilities π^=(0.08,0.43,0.34,0.13). Both set of probabilities composed a transition model that can be used to predict the probability of any DNA sequence using equation (2).

References

Axelson-Fisk, Marina. 2015. Comparative Gene Finding. Vol. 20. Computational Biology. London: Springer London. http://link.springer.com/10.1007/978-1-4471-6693-1.