Comparative analysis of Gene Regulatory Regions Using Mutual Information

Home Mutual Information Software

Objectives : The present thesis attempts to answer the following questions:

  1. How the information is distributed in the core promoter?
  2. What is the nature of TSS region in different species / Organisms?
  3. Functional relation of TFs based on the information content of their TFBS.

Databases used in this study various (open source) databases that are freely downloadable from the internet.

  1. PlantPromDB- A plant promoter database (Shahmuradov et al., 2003)
  2. PromEC- E.coli promoter database (Hershberg et al., 2001)
  3. EPD- Eukaryotic promoter database (Perier et al., 1998).
  4. Entrez genome - Mitochondrial control elements (www.ncbi.nlm.nih.gov).
  5. JASPAR- Transcription Factor Binding Sites (Sandelin et al., 2004)
  6. TRRD- Transcription Regulatory Regions Database (Kolchanov et al., 2002)

The following steps are used to calculate the information content of DNA sequences.

  1. Sequence alignment
  2. Construction of substitution matrices
  3. Neighbor-independent and -dependent substitutions
  4. Average mutual Information content (H).
Publications
  1. D Ashok Reddy, B V L S Prasad and Chanchal K Mitra. (2006) Comparative analysis of core promoter region: Information content from mono and dinucleotide substitution matrices. Computational Biology and Chemistry, 30, 58-62.
  2. D Ashok Reddy, B V L S Prasad and Chanchal K Mitra. (2006) Functional classification of transcription factor binding sites: Information content as a metric. Journal of Integrative Bioinformatics, 3(1), 0020.
  3. D Ashok Reddy and Chanchal K Mitra. (2006) Comparative analysis of transcription start site using mutual information. Genomics, Proteomics & Bioinformatics. 4(3), 183-195.

Calculation of Information Content from DNA sequences

1.Sequence Alignment

Input: 10 sequences(seq1-seq10) of length 6 nt each. These sequences already aligned based on the TSS which is represented by +1 in the databases. The following calculations explain about How to calculate information content from aligned sequences

Seq-1...GTACTA...
Seq-2...ATCTCC...
Seq-3...GTACTC...
Seq-4...AACGAT...
Seq-5...GGATTC...
Seq-6...GTTGAT...
Seq-7...CTACTC...
Seq-8...TCGCGC...
Seq-9...CACATG...
Seq10...GTACCA...

2.Substitution matrices

We have used two types of substitution matrices which are based on the type nucleotide substitutions. In the first case we have used single nucleotide substitutions (conventional) and second case is two nucleotides at a time (dinucleotide). Single nucleotide substitutions give 4x4 matrix and two nucleotide (di nucleotide) substitutions will give 16x16 matrix. We have also considered both positional as well as block wise substitution matrix construction in both mono and di-nucleotide substitution matrices construction

Mono-nucleotide (nucleotide-independent) substitution matrices

For each column of the block, we first count the number of matches and mismatches of each type between the first sequence and every other sequence in the block. This procedure is repeated for all columns of the block with the summed results stored in a 4×4 matrix. For all sequences in the aligned sequences, the same procedure is followed summing these numbers with those that already in the 4×4 matrix. While calculating matches and mismatches the sliding window of one nucleotide along the sequence is used to count the all-possible pairs in a given block. The total number of nucleotide pairs (observed frequency) in a given block is ˝ws(s-1) and the total number of nucleotides (expected frequency) in the block is ws, where s is the number of nucleotides in the given position and w is the block width. The resulting matrix (4×4 matrix) is used to calculate the odds-ratio between those observed frequencies q(ij) and those expected by chance p(i). This odds ratio q(ij)/(p(i)p(j)) is also called a likelihood ratio. Then “log-odds” is calculated (usually logarithm to the base 2) from the odds-ratio and is given by s(ij)=log2(q(ij)/p(i)p(j)). Such probabilities (odds ratios) should be multiplied or log-odds can be added to get the probability of their independent occurrence (Karlin and Altschul, 1990; Altschul, 1991).

Position (column) wise nucleotide frequency
1stPosition 
G GAGGGAGGGGGCGTGCGG
A AGAAAGAGACATACAG
G GAGGGGGCGTGCGG
A AGAGACATACAG
G GGGCGTGCGG
G GCGTGCGG
C CTCCCG
T TCTG
C CG
G 
After adding all the columns of a block is as follows. As this is mono-nucleotide substitution, the matrix is 4x4 (in case of di-nucleotide substitutions the will be 16x16 matrix)
 ACGT
A14201516
C20251416
G08171109
T19251427
Total nucleotides in the block
ACGT
14181117
The total number of nucleotide pairs in a given block = ˝.ws(s-1)
 = ˝.6.10.(10-1)
 = 270
The total number of nucleotides in the block = ws
 = 6.10
 = 60
 s = number of nt in the block
 w = block width.
Observed frequency (p(ij)) of GT = p(GT) = 9/270
Expected frequency (p(i).p(j))of GT= p(G).p(T)=(11/60).(17/60)
Odds ratio = p(GT)/p(G).p(T)
 ACGT
A0.95221.05821.29870.8963
C1.05821.02880.94270.6971
G0.69261.14471.14470.6417
T1.06440.99820.99821.2456

log odds ratio s(ij) = log2(p(ij)/p(i).p(j))
 ACGT
A-0.070380.08160.3770-0.1578
C0.081610.0409-0.0850-0.5204
G-0.52980.19500.2775-0.6400
T0.09000.1234-0.002570.3169

Di-nucleotide (nuclotide-dependent) substitution matrices

The incorporation of the pair-preferences into the substitution matrix gives neighbor-dependent substitution matrices. These are very similar to neighbor-independent substitution matrices except that the subscripts will be pairs of nucleotides. While calculating matches and mismatches the sliding window of one nucleotide along the sequence is used to count of all possible pairs in the given block. The total number of dinucleotide pairs (observed frequency) in a given block is ˝s(w-1)(s-1) and the total number of dinucleotides (expected frequency) is given by s(w-1), where s is the number of sequences and w is the block width. The resulting matrix (16×16 matrix) is used to calculate the odds-ratio between those observed frequencies q(ij,kl) and those expected by chance p(ij). This odds ratio q(ij,kl)/(p(ij)p(kl)) (also called likelihood ratio) is then used to calculate the “log-odds” and is given by s(ij,kl)=log2 q(ij,kl)/(p(ij)p(kl)).

3. Average mutual information content (H)

The comparison of these non-coding regions can be performed either by scores in the substitution matrix themselves or by the information content of these substitution matrices. In information theoretic terms average mutual information content (H), is the relative entropy of the target and background pair frequencies and can be thought of as a measure of the average amount of information (in bits) available per nucleotide pair. In neighbor-independent substitution matrices, the log-odds of each nucleotide pair s(ij) (in the units of log2, called bits) multiplied by the probability of occurrence of that pair q(ij) will give the weighted score and is then summed overall for the nucleotide pairs to produce a score that represents the ability of the average nucleotide pair in the matrix to discriminate the actual alignment from chance alignments. The average mutual information content is given by

H = ∑ij q(ij)s(ij)

The following table values are H(ij) of each pair. Sum of all the values in the table will give H in bits, which is called average mutual information content

 ACGT
A-0.003640.00600.0209-0.00935
C0.006040.00379-0.0044-0.0308
G-0.015690.012280.0113-0.0213
T0.006330.0114-0.00010.03169

Sum of all H(ij) will give H = 0.024464 bits

The higher the value of the relative entropy of target and background distributions, the more easily they are distinguished (Altschul, 1991). The same procedure is applied for calculating the average mutual information content in the case of neighbor-dependent substitution matrices. The average information content in neighbor-dependent substitution matrices is given by

H = ∑ij,kl q(ij,kl)s(ij,kl)

The maximum value of H in DNA is 2 bits in neighbor independent nucleotide substitutions (4 bits in case of nucleotide dependent substitutions. The maximum value of H occurs when all the 4 nucleotides or 16 nucleotide pairs are in equiprobable distribution (completely random).

Standard error calculation

To assess the reliability of our computations, we have performed a simple error analysis of the results. We consider the matrix elements H(ij) of the information content matrix s(ij)q(ij) as the elements of our data and compute the standard error of the 16 (or 256 in case of the pair preferences) elements using standard techniques. The standard errors are plotted in the graph along with the histograms.

Std.err = 0.003809