Comparative analysis of Gene Regulatory Regions Using Mutual Information | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Objectives : The present thesis attempts to answer the following questions:
Databases used in this study various (open source) databases that are freely downloadable from the internet.
The following steps are used to calculate the information content of DNA sequences.
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Calculation of Information Content from DNA sequences |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
1.Sequence Alignment Input: 10 sequences(seq1-seq10) of length 6 nt each. These sequences already aligned based on the TSS which is represented by +1 in the databases. The following calculations explain about How to calculate information content from aligned sequences |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Seq-1...GTACTA... Seq-2...ATCTCC... Seq-3...GTACTC... Seq-4...AACGAT... Seq-5...GGATTC... Seq-6...GTTGAT... Seq-7...CTACTC... Seq-8...TCGCGC... Seq-9...CACATG... Seq10...GTACCA... |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
2.Substitution matrices We have used two types of substitution matrices which are based on the type nucleotide substitutions. In the first case we have used single nucleotide substitutions (conventional) and second case is two nucleotides at a time (dinucleotide). Single nucleotide substitutions give 4x4 matrix and two nucleotide (di nucleotide) substitutions will give 16x16 matrix. We have also considered both positional as well as block wise substitution matrix construction in both mono and di-nucleotide substitution matrices construction Mono-nucleotide (nucleotide-independent) substitution matrices For each column of the block, we first count the number of matches and mismatches of each type between the first sequence and every other sequence in the block. This procedure is repeated for all columns of the block with the summed results stored in a 4×4 matrix. For all sequences in the aligned sequences, the same procedure is followed summing these numbers with those that already in the 4×4 matrix. While calculating matches and mismatches the sliding window of one nucleotide along the sequence is used to count the all-possible pairs in a given block. The total number of nucleotide pairs (observed frequency) in a given block is ˝ws(s-1) and the total number of nucleotides (expected frequency) in the block is ws, where s is the number of nucleotides in the given position and w is the block width. The resulting matrix (4×4 matrix) is used to calculate the odds-ratio between those observed frequencies q(ij) and those expected by chance p(i). This odds ratio q(ij)/(p(i)p(j)) is also called a likelihood ratio. Then “log-odds” is calculated (usually logarithm to the base 2) from the odds-ratio and is given by s(ij)=log2(q(ij)/p(i)p(j)). Such probabilities (odds ratios) should be multiplied or log-odds can be added to get the probability of their independent occurrence (Karlin and Altschul, 1990; Altschul, 1991). Position (column) wise nucleotide frequency
log odds ratio s(ij) = log2(p(ij)/p(i).p(j))
Di-nucleotide (nuclotide-dependent) substitution matrices The incorporation of the pair-preferences into the substitution matrix gives neighbor-dependent substitution matrices. These are very similar to neighbor-independent substitution matrices except that the subscripts will be pairs of nucleotides. While calculating matches and mismatches the sliding window of one nucleotide along the sequence is used to count of all possible pairs in the given block. The total number of dinucleotide pairs (observed frequency) in a given block is ˝s(w-1)(s-1) and the total number of dinucleotides (expected frequency) is given by s(w-1), where s is the number of sequences and w is the block width. The resulting matrix (16×16 matrix) is used to calculate the odds-ratio between those observed frequencies q(ij,kl) and those expected by chance p(ij). This odds ratio q(ij,kl)/(p(ij)p(kl)) (also called likelihood ratio) is then used to calculate the “log-odds” and is given by s(ij,kl)=log2 q(ij,kl)/(p(ij)p(kl)). 3. Average mutual information content (H) The comparison of these non-coding regions can be performed either by scores in the substitution matrix themselves or by the information content of these substitution matrices. In information theoretic terms average mutual information content (H), is the relative entropy of the target and background pair frequencies and can be thought of as a measure of the average amount of information (in bits) available per nucleotide pair. In neighbor-independent substitution matrices, the log-odds of each nucleotide pair s(ij) (in the units of log2, called bits) multiplied by the probability of occurrence of that pair q(ij) will give the weighted score and is then summed overall for the nucleotide pairs to produce a score that represents the ability of the average nucleotide pair in the matrix to discriminate the actual alignment from chance alignments. The average mutual information content is given by H = ∑ij q(ij)s(ij) The following table values are H(ij) of each pair. Sum of all the values in the table will give H in bits, which is called average mutual information content
Sum of all H(ij) will give H = 0.024464 bits The higher the value of the relative entropy of target and background distributions, the more easily they are distinguished (Altschul, 1991). The same procedure is applied for calculating the average mutual information content in the case of neighbor-dependent substitution matrices. The average information content in neighbor-dependent substitution matrices is given by H = ∑ij,kl q(ij,kl)s(ij,kl) The maximum value of H in DNA is 2 bits in neighbor independent nucleotide substitutions (4 bits in case of nucleotide dependent substitutions. The maximum value of H occurs when all the 4 nucleotides or 16 nucleotide pairs are in equiprobable distribution (completely random).Standard error calculation To assess the reliability of our computations, we have performed a simple error analysis of the results. We consider the matrix elements H(ij) of the information content matrix s(ij)q(ij) as the elements of our data and compute the standard error of the 16 (or 256 in case of the pair preferences) elements using standard techniques. The standard errors are plotted in the graph along with the histograms. Std.err = 0.003809 |