CodeCereal: python

Showing posts with label python. Show all posts

Jan 16, 2014

PIL -> Pillow

Pillow is a PIL fork created to add new features. setuptools support was also added. A more frequent release cycle was also promised. With Pillow you can have PIL as a package dependency in setuptools and virtualenv. That means less clutter and robustness for the development. :-)

Pillow allows you to continue to use import PIL , so there is no need to change your current PIL related code. 0 migration overhead.

Archlinux already dropped support for PIL in favor of Pillow.

TL;DR PIL > Pillow

Jan 9, 2014

Python cookbook: get the file dir path

import os
os.path.dirname(os.path.abspath(__file__))

Jun 11, 2013

Nix

For the last few months my team (OpenBossa) at INDT (Instituto Nokia de Tecnologia) have been working on WebKitNix (Nix for short). WebKitNix is a new WebKit2 port, fully based on POSIX and OpenGL. We also use CMake build system (like GTK and Efl), GLib, libsoup (networking) and Cairo (2D graphics) as dependencies. It also uses Coordinated Graphics and Tiled Backing Store from WebKit2. Many of its building blocks are shared with others ports already on trunk.

The Nix port offers a C API for rendering a WebView within a OpenGL context, based on WebKit2 API. You can use Nix to create applications such as a web browser or a web runtime. WebKit2 run the context for each web page in a different process. This process isolation keeps the UI responsive, with smooth animations and scrolling, because it does not get blocked by Javascript execution.

We want to ease the work of Consumer Electronics developers who wants to have a web runtime in their platform, without the need to create another WebKit port. That is why Nix has less dependencies than Efl, GTK or Qt, which should also be ported to the target platform.

Nix API also enables the application to interact with the Javascript context. So it is possible to add new APIs, to handle application specific features.

How did it started?

The OpenBossa team used to maintain the Qt WebKit port for years, helping Nokia/Trolltech. But then, in the last years, from the experience gathered with the Snowshoe browser, handling with dependencies (such as QtNetwork) that were much bigger than we really needed. We tried to replace some dependencies of QtWebKit and later Efl to see how minimal WebKit could be. So we took some steps:

Initial idea: platform/posix or platform/glib (share code)
Ivolved problem: we wanted to have different behaviors for QQuickWebView -> Qt Raw WebView
Network: QtWebKit + Soup experiment
Efl Raw WebView experiment
Efl Without Efl :-)
Nix

How to use it?

When you compile Nix source code you can run the MiniBrowser to test it:

$ $WEBKITOUTPUTDIR/Release/bin/MiniBrowser http://maps.nokia.com

MiniBrowser code

The Nix-Demos repository offers some example code, including a Glut based browser and minimal Python bindings for Nix: https://github.com/WebKitNix/nix-demos.

On Nix-Demos we also have a Nix view, using only a DispmanX and OpenGL ES 2 working on the Raspberry Pi. To compile this demo, you will need our RaspberryPi SDK .

There is even a browser with its UI written in HTML: Drowser

Feel free to contact us on #webkitnix at freenode.

Roadmap

Our plan is to upstream Nix in WebKit trunk by June/2013. Then, keep the maintainence and focus on the web platform, including some new HTML5 features, such as WebRTC.

Jul 7, 2011

Protein Profile with HMM

Profiling is the action of summarizing a set of data in a smaller mathematical model. One of the practical usages of profiling techniques is the classification of sequences. With a data set profile, you may calculate the distance of an instance to the model, and classify the instance trough this value.

Profiling proteins is a more specific case of profiling sequences. As we know from the previous ṕost about Hidden Markov Models (HMMs) is a very robust mathematical model to represent probabilistically sequences. This post will detail how to profile sets of proteins with a HMM model.

The HMM model for this problem must be a HMM which when we calculate a probability of generation of a protein by the model, must give higher probabilities for proteins similar for the ones used to create the profile against the others.

To build the model, I've started with a multiple sequence alignment to see how is the structure of the sequences. That means that our HMM will be a probabilistic representation of a multiple alignment. With the alignment we see that some columns are complete and some are almost, and the others have few data. These most common matches can be used as a common matches for our model and the deletions and insertions can be modelled as other states. Here is an example of a few sequences aligned and the core columns of the alignment (marked with an *):

A C - - - - A T G

T C A A C T A T C

A C A C - - A G C

A G A - - - A T C

A C C G - - A T C

* * *             * * *

With this sample of alignment we have examples of insertions and deletions between the core alignments, which may have a state on the HMM to represent them. Note that insertions may occur on arbitrary times between the matching states. And the deletion states always replaces some matching. One possible HMM template for building the model is presented in the following picture:

HMM protein profile model

Each matching state (M_j) is related to the matching on the j^th core alignment. The same applies for deletion state (D_j). The insertion is slightly different, because it represents the insertions between the core alignments, that's why it has one extra state, and this enable to represent states before the first and after the last alignment.

Our model will be like the one in the picture with the same length as the core alignments of a multiple alignment for a given set of sequences. However we should use maximum likelihood to estimate the transitions and emission for each state. The easiest way to count the total emissions/transitions, is to thread each sequence to be profiled in the model. For each symbol in the aligned sequence you must check if the symbol is in the core alignment. If it is, then increment the count of that state to the next match state, otherwise, you go to the insertion state. If it is a deletion, go to the deletion state and increment the transition. Finally, to calculate the probability for the transitions, just divide the count of each transition and divide it by all the states leaving the same state. It's important to notice that we have a stopping state, and this is a special state that has no transitions.

Note that is important to initialize the model with pseudo counts at each possible transitions. Adding this pseudo count, we let our model less rigid and avoid overfitting to the train set. Otherwise we could have some zero probabilities for some sequences.

To create the emissions probabilities, at each match state, you also have to count which symbol was emitted and then increment them. To calculate the probability, just divide it by the total symbols matched in the threading process.

The similar process could be done with the insertion. However, the insertion states are characterized by having a low occurrence rate and this may lead us directly to a overfitting problem because the number of emissions must be too small. To avoid this we should use the background distribution as the emissions probabilities of each insertion state. The background distribution is the probability of the occurrence of a given amino acid in the entire protein set. To calculate this, count each amino acid type for all the train set sequence and then divided by the total count.

For the deletion states, it's important to notice that it is a silent state. It doesn't emit any symbol at all. To signalize it in ghmm, just let all the emissions probabilities with 0. Note that the end state is also a silent state once we have no emission associated to it.

ALERT: the loglikelihood is the only function in the ghmm library which handles silent states correctly.

At this point, the model is ready for use. However we have a problem of how to classify a new protein sequence. A threshold could be used to divide the two classes. However, the most common alternative is to use a null model to compare with it. The null model is a model which aims represents any protein with similar probability as any other. With this two models we could take a sequence and compare if it is more similar to a general definition of a protein or to a specific family. This model should model a sequence with average size equal to the aligned sequences being handled, and should emit any kind of symbol at each position. A common alternative for creating the null model, is using a single insertion state, which goes to a stopping state with probability of 1 divided by the average length of sequences in the train set. For the emissions probability, we should use the background distribution, because this is related to the general amino acid distribution. At the end, the model should be like this:

Null Model

For testing the proposed model, I used a set of 100 globin proteins from the NCBI protein repository as a train set to build a profile model, and used ghmm to build and test the model.

To test if our model corresponds to our expectations, 100 globins different from the ones in the trains set were used with 1800 other random proteins with similar length. The loglikelihood function from the ghmm library to calculate the similarity index. The classification of the globins versus non globins, was a comparison between the likelihood of the protein with the globins profile hmm, and the null model. This test gave us 100% of accuracy! To display this result graph, each sequence was pointed out in a graph where the horizontal axis displays the length of the sequence and the vertical the log of globins / null models likelihood (or globins - null model loglikelihood). The globins are plotted in red an the others in blue. Proteins over the zero line are classified as globins, and below means they aren't globins. The graphs show us a clear distinction between the classes, and that our model is very precise for this problem.

Length x log(Globin's Profile Model Likelihood/Null Model Likelihood)

Jun 11, 2011

Optimizing Functions with Python Caching Decorators

On these last months I've been solving some problems (such as some HMMs algorithms) which the best solutions involves some kind of dynamic programming. Some of them are quite simple to implement, but their recursive formulation are far more intuitive. The problem is that even in functional languages, the recursive functions aren't well handled unless you some mechanism like tail call, which aren't intuitive as we would like to. The simplest example that comes in my mind is the fibonacci function which is usually defined as:



fib(0) = 1

fib(1) = 1

fib(n) = fib(n-1) + fib(n-2)

As we know, almost all the languages compilers and interpreters use the call stack to call the recursives cases on functions being executed. We can analyze the following C fibonacci version:



int fib(n)

{

    if (n == 0 || n == 1)

        return 1;

    else

        return fib(n-1) + fib(n-2);

}

It is really simple to understand when contrasted with the definition. But, if we make a trace of the the program (even with a small input value) we'll have something like the following evaluation order:



fib(6) = fib(5) + fib(4)

fib(5) = fib(4) + fib(3)

fib(4) = fib(3) + fib(2)

fib(3) = fib(2) + fib(1)

fib(2) = fib(1) + fib(0)

fib(3) = fib(2) + fib(1)

fib(2) = fib(1) + fib(0)

fib(4) = fib(3) + fib(2)

fib(3) = fib(2) + fib(1)

fib(2) = fib(1) + fib(0)

fib(3) = fib(2) + fib(1)

fib(2) = fib(1) + fib(0)

Call stack for fib(6)

As we can see, there is a repetition of the calculation of fib 4 to 1, many times and is something we can avoid. In fact, the complexity of this solution has a exponencial computational complexity because for each n from input we branch it in 2 until it reachs 0 or 1 approximately n times, leading us into a O(2ⁿ) complexity. A simple way to avoid it, is converting into a interactive form:



int fib(int n)

{

    int current = 1;

    int previous = 1;

    int i;

    for (i = 1; i < n; i++) {

        int temp = current; // XXX: nasty
        current += previous;
        previous = temp;
    }

    return current;
}

The same result is achieved by using tail call for functional languages.

As you can see, it obfuscates the intuitive definition given in as the recursive formulation. But we still have a problem whenever we calculate fib(n), we have to recalculate it's previous results even if they was previously calculated. If this function is used many times in our program it will take a lot of processing re-computing many of the values. We can avoid this by using the dynamic programming, which keeps the record of previously calculated results. The drawback of this technique is the memory usage, which for large entries can become a bottleneck. However, processing usually is a more rare computer resource. A C implementation (not the most elegant) for it is:



// XXX: kids, don't do this at home

int fib_results[10000];

int last_fib;



int fib(int n)

{

    if (n <= last_fib 

        return fib_results[n];



    int current = fib_results[last_fib-1];

    int previous = fib_results[last_fib-2];

    for (; last_fib < n; last_fib++) {

        int temp = current;

        current += previous;

        fib_results[last_fib] = current;

        previous = temp;

    }



    return current;

}



int main()

{

    fib_results[0] = 1;

    fib_results[1] = 1;

    last_fib = 1;



    // ... other stuff ...



    return 0;

}

As we can see, dynamic programming isn't too hard to implement. On the other hand, reading a code it's a though task to do unless you are already familiar with the algorithm.

If we extract what is dynamic programming fundamental concept, which is "store pre-computed results", we find a regularity in every recursive function which we can be transformed into a dynamic programming one. One of the reasons I love python because it's easy to use meta-programming concepts, and that's what I will use to transform recursive functions into it's dynamic form in a ridiculous easy way using function decorators.

Function decorators (or annotations in Java) are a form of meta-programming for functions. It extends functions with some functionalities, such as debugging, tracing, adding meta-data to the function, synchronization or memoization (not memorization) of values, which is a great way of optimizing recursive functions by caching their results (if you have enough memory available). One possible implementation of memoitized decorator in python is the following:



def cached(function):

    cache = {}

    def wrapper(*args):

        if args in cache:

            return cache[args]
        else:
        result = function(*args)
        cache[args] = result

             return result

return wrapper

Note that I'm not using kwargs because they're not hashable, such as the tuple args, and will add a few complexity in the example. See also that we a have a function that returns another function, which uses a given one to calculate results and store them in a cache. To cache our fib function we may use the following code:



@cached

def fib(n):

    if n == 0 or n == 1:

        return 1

    else:

        return fib(n-1) + fib(n-2)



# or in a not so clean version:

def normal_fib(n):

    if n == 0 or n == 1:

        return 1

    else:

        return fib(n-1) + fib(n-2)



fib = cached(normal_fib)

This technique is really useful to improve your code performance in a really easy. On the other hand, it isn't the best solution for almost all the cases. Many times code a dynamic programming method (if performance is crucial) will be necessary. Is also important to notice that I didn't used any cache memory management policy, which is important to economize memory. Most appropriate cache data structures (such as numpy arrays for integer arguments) also are welcome. The python 3.2 version added the lru_cache decorator into the functools module to make this cache mechanism. If you are already using this version, it's smarter to use it instead of implementing your one. Here is how it must be used:



# Python > 3.2

import functools

@functools.lru_cached(max_size=500) # uses a fixed size cache to avoid memory usage explosion

def fib(n)

    ...

This technique is very useful not only for economize the CPU resources but also network (such as caching SQL query results), other IO operations (such as disk reading) and even user interaction input in a straightforward way.

Jun 5, 2011

CpG Islands (3) - Model Evaluation

Following the post we've built a Hidden Markov Model (HMM) for the CpG islands problem, using the training set. Now we should evaluate if our model is adequate to predict things about CpG islands. For evaluate we may use a tagged sequence and see if the HMM we built can predict the islands and what is it's accuracy.

Using the viterbi path and a tagged sequence (out ouf the training set), enable us to compare if the estimative given by the model is coherent with the real data. Using the HMM and the training and test sets of the last post, we can compare the results using a confusion matrix. For example, at the following snippet of the viterbi path paired with the tagged sequence:



[ ... 0,0,0,0,0,1,1,1,1,1, ... ]

[ ... a t t g t c C G A C  ... ]

We may calculate the following confusion matrix:



  +----------- P R E D I C T E D --------+

A |                      |     Island     | Non Island |

C +------------+------------+------------+

T |     Island     |           4         |      0         |

U +------------+------------+------------+

A | Non Island |           1     |           5         |

L +------------+------------+------------+

Note that the bigger are the main diagonal values, the more accurate our model is. Making this matrix with the experiment made (using HMM with maximum likelihood) I got the following matrix:



  +----------- P R E D I C T E D --------+

A |                      |     Island     | Non Island |

C +------------+------------+------------+

T |     Island     |   328815       |    61411       |

U +------------+------------+------------+

A | Non Island |   1381515  |   1889819  |

L +------------+------------+------------+

We may see that a good amount of islands was correctly predicted (84,26% of the islands), but 57,76% of the non islands regions was classified as islands, which is a bad indicator. A experiment with the HMM after running the Baum Welch was also evaluated. The results was better from the obtained by the maximum likelihood when predicting islands (96,56% of accuracy). But also obtained a higher miss rate (70,70%). Here is the confusion matrix for this HMM:

   +----------- P R E D I C T E D --------+

A |                      |     Island     | Non Island |

C +------------+------------+------------+

T |     Island     |   376839       |    13387       |

U +------------+------------+------------+

A | Non Island |   2313138  |   958196       |

L +------------+------------+------------+

Matching of Real Sequence, Viterbi and Posterior results

The result obtained was not very accurate. It may happened because we had few data to train and to evaluate our model. We could also build a HMM with other indicators that identify the CpG islands for model it better. Remember we simplified the CpG problem to the frequencies of occurring Cs and Gs, but a better model could also use the CG pairs. Using a more complicate HMM we could have more than 2 states and then associate a set of states to the CpG and non CpG islands sites. This would allow to use better the posterior result to classify the sequence. But I keep this as future work.

Jun 2, 2011

CpG Islands (2) - Building a Hidden Markov Model

By the definition of the CpG islands in the previous post and the Hidden Markov Models (HMMs) short introduction, we now can model a HMM for finding CpG islands. We can create a model very similar to the "Fair Bet Casino" problem.

When we are in a nucleotide of given DNA sequence there are two possibilities, that nucleotide belongs to CpG island (lets denote state S1) or do not (S0). If you analyse a sibling nucleotide it can stay in the same state or to change with complementary probabilities. We can view it as this Markov chain:

CpG states Markov chain

As the emissions of the states S0 and S1, we may have the associated probabilities of the emissions of ACGT symbols. However we can't know for now how much are these probabilities. However if we have a dataset of sequences with the attached information of where the CpG islands occurs, we can estimate those parameters. I used these DNA sequences tagged by biologists (used biopython library to parse the .fasta files) as the training and test sets to build the model and evaluate the technique. This dataset consists of nucleotides in upper and lower cases. When we have a given nucleotide in upper case, it denotes that it is a CpG site and the lower case nucleotides means they are not. I used a statistical technique called maximum likelihood to build the HMM.

Using the maximum likelihood to calculate the HMM emissions probabilities, I count each char frequency in the dataset and divide by the total amount of nucleotides in the same state:

Emissions probability diagram

P(A|S0) = a / (a+c+g+t)

P(C|S0) = c / (a+c+g+t)

P(G|S0) = g / (a+c+g+t)

P(T|S0) = t / (a+c+g+t)

P(A|S1) = A / (A+C+G+T)

P(C|S1) = C / (A+C+G+T)

P(G|S1) = G / (A+C+G+T)

P(T|S1) = T / (A+C+G+T)

And to calculate the transition probabilities of each state, count the number of transitions between a CpG to a non CpG site (out_of) and the reversal transitions (into). Then divide each of them by the state frequency which they're coming from. Note that the permanence probability of each state is the complement of the probability of transiting between states since there is no other state to go:



P(S0|S1) = out_of / (A+C+G+T)

P(S1|S0) = into / (a+c+g+t)

P(S0|S0) = 1 - P(S1|S0)

P(S1|S1) = 1 - P(S0|S1)

Our model is almost done, it just lacks of the initial transitions which can be supposed to be the frequency of each state over all nucleotides:



P(S0) = a+c+g+t/(a+c+g+t+A+C+G+T)

P(S1) = A+C+G+T/(a+c+g+t+A+C+G+T)

I used the ghmm to create my HMM with the given dataset and it generated this model:



DiscreteEmissionHMM(N=2, M=4)

  state 0 (initial=0.97)

    Emissions: 0.30, 0.21, 0.20, 0.29

    Transitions: ->0 (0.00), ->1 (1.00)

  state 1 (initial=0.03)

    Emissions: 0.22, 0.29, 0.29, 0.20

    Transitions: ->0 (1.00), ->1 (0.00)

It's important to notice that when we print the HMM it says that has the probability of 1 to stay in the same state. This occurs due to float formatted print. The reality is that in that value is very close to 1 but still has a chance to hop between states. It's also interesting to see the probability of emission of C and G when we contrast the states. In S1 the probability of getting a C or a G is significantly higher than in S0.

Another manner of creating the HMM, is by initializing a model with the same structure we modelled the problem and then apply some unsupervised learning algorithms, such as the Baum Welch algorithm. The Baum Welch uses randomized initial weights in the HMM and once you feed untagged sequences, it tries to infer a model. However, usually is far better initialize your HMM with biased data about the problem (for example using the model we created) and let the Baum Welch calibrate the probabilities. With ghmm do the following:



sigma = ghmm.IntegerRange(0,4) # our emission alphabet

emission_seq = ghmm.EmissionSequence(sigma, [0, 4, 5, ... ]) # create a emission sequence

hmm.baumWelch(sigma, emission_seq) # use a template hmm to calibrate the probablities

In the next post I will show how is the evaluation process of the HMM in the CpG island.

May 30, 2011

Hidden Markov Models

Nowadays, many applications use Hidden Markov Models (HMMs) to solve crucial issues such as bioinformatics, speech recognition, musical analysis, digital signal processing, data mining, financial applications, time series analysis and many others. HMMs are probabilistic models which are very useful to model sequence behaviours or discrete time series events. Formally it models Markov processes with hidden states, like an extension for Markov Chains. For computer scientists, is a state machine with probabilistic transitions where each state can emit a value with a given probability.

For better understanding HMMs, I will illustrate how it works with "The Fair Bet Casino" problem. Imagine you are in a casino where you can bet on coins tosses, tossed by a dealer. A coin toss can have two outcomes: head (H) or tail (T). Now suppose that the coin dealer has two coins, a fair (F) which outputs both H and T with 1/2 probabilities and a biased coin (B) which outputs H with probability 3/4 and T with 1/4. Using probability language we say:

P(H_i+1|F_i) = 1/2
P(T_i+1|F_i) = 1/2
P(H_i+1|B_i) = 3/4
P(T_i+1|B_i) = 1/4

Now imagine that the dealer changes the coin in a way you can't see, but you know that he does it with a 1/10 probability. So thinking the coin tosses as a sequence of events we can say:

P(F_i+1|F_i) = 9/10
P(B_i+1|F_i) = 1/10
P(B_i+1|B_i) = 9/10
P(F_i+1|B_i) = 1/10

We can model it using a graph to illustrate the process:

HMM for "The Fair Bet Casino" problem

That's a HMM! It isn't any rocket science. Is just important to add a few remarks. We call the set of all possible emissions of the Markov process as the alphabet Σ ({H, T} in our problem). For many of computational method involving HMMs you will also need a initial state distribution π. For our problem we may assume that the we have equal probability for each coin.

Now comes in our mind what we can do with the model in our hands. There are lot's of stuff to do with it, such as: given a sequence of results, when the dealer used the biased coin or even generate a random sequence with a coherent behaviour when compared to the model.

There is a nice library called ghmm (available for C and Python) which handles HMMs and already gives us the most famous and important HMM algorithms. Unfortunately the python wrapper is not pythonic. Let's model our problem in python to have some fun:





import ghmm



# setting 0 for Heads and 1 for Tails as our Alphabet 

sigma = ghmm.IntegerRange(0, 2)



# transition matrix: rows and columns means origin and destiny states

transitions_probabilities = [

    [0.9, 0.1], # 0: fair state

    [0.1, 0.9], # 1: biased state

]



# emission matrix: rows and columns means states and symbols respectively

emissions_probabilities = [

    [0.5, 0.5], # 0: fair state emissions probabilities

    [0.75, 0.25], # 1: biased state emissions probabilities

]



# probability of initial states

pi = [0.5, 0.5] # equal probabilities for 0 and 1



hmm = ghmm.HMMFromMatrices(

sigma,

    # you can model HMMs with others emission probability distributions

    ghmm.DiscreteDistribution(sigma),

    transitions_probabilities,

    emissions_probabilities,

    pi

)

>>> print hmm
DiscreteEmissionHMM(N=2, M=2)



  state 0 (initial=0.50)

    Emissions: 0.50, 0.50

    Transitions: ->0 (0.90), ->1 (0.10)



  state 1 (initial=0.50)

    Emissions: 0.75, 0.25

    Transitions: ->0 (0.10), ->1 (0.90)

Now that we have our HMM object on the hand we can play with it. Suppose you have the given sequence of coin tosses and you would like to distinguish which coin was being used at a given state:



tosses = [1, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0,  1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 1]

The viterbi algorithm can be used to trace the most probable states at each coin toss according to the HMM distribution:



# not as pythonic is could be :-/

sequence = ghmm.EmissionSequence(sigma, tosses)



viterbi_path, _ = hmm.viterbi(sequence)



>>> print viterbi_path

[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]

Nice! But sometimes is interesting to have the probability of each state on the point instead of only the most probable one. To have that, you must use the posterior or forward algorithms to have more detailed information.



states_probabilities = hmm.posterior(sequence)



>>> print

states_probabilities
[[0.8407944139086141, 0.1592055860913865], [0.860787703168127, 0.13921229683187356], ... ]

The posterior method result, returns the list of probabilities at each state, for example, in the first index we have [0.8407944139086141, 0.1592055860913865]. That means that we have ~0.84 probability of chance that the dealer is using the fair coin and ~0.16 for the biased coin. We also can plot a graph to show the behaviour of the curve of the probability of the dealer being using the fair coin (I used matplotlib for the graphs).

Probability of being a fair coin over time

This is only a superficial example of what can HMMs do. It's worthy give a look at it if you want do some sequence or time series analysis in any domain. I hope this post presented and cleared what are HMM and how they can be used to analyse data.

May 29, 2011

CpG Islands (1) - Problem Motivation & Definitions

This semester I'm attending the course Processing of Sequences and Methods and Algorithm in Computational Biology (basically DNA and proteins). One of the focus of it is the use of the Hidden Markov Models to solve many of it's problems. One well studied problems is how to find codifying sequences (snippets that can be translated into proteins) in a given sequence of DNA, which has both codifying and non codifying regions. One of the most used techniques is called CpG islands.

In a random DNA sequence we have a lower frequency of two ((G)uanine and (C)ytosine) than the other two nucleotide types ((A)denine and (T)hymine) due to a phenomenon called DNA methylation. Due to this methylation, it's more probable to occur a deletion of a C and G nucleotides than with A or T


DNA double helix structure

In the species evolution, occurs that genes which are essential for it's survival cannot mutate much, otherwise the organism can't live and reproduce to transmit it's gene to their descendants. Because of this, is very unlikely that even with lots of generations, those genes doesn't suffer much alteration. It's important to notice that non codifying regions can mutate without affecting the organism because it doesn't affect a vital part of the genetic code.

With these two facts in hand we can see that mutation of the DNA can occur more frequently on non codifying regions. As the methylation of C is a very common mutation cause, we can observe that non codifying regions will have less C and G due to the methylation which it doesn't interfere (much) in the organism. The opposite occurs with codifying regions because it has less frequent mutations.


CpG island around an important human DNA regulation site

With this information in our hands we can use it to discover regions which are likely to be a codifying region. The sites which contains greater concentrations of C-phosphate-G (called CpG islands) are a strong indicator that it hasn't suffered much mutations through the generations. So it's very probable to be an area of interest for biologists. I'll describe in the next posts how can we identify these CpG using HMM and the ghmm python library.

CodeCereal