Technical Solutions

Tuesday, August 25, 2020

Different types of scoring matrix

We know how to calculate scoring matrix. There are different types of scoring matrix based on the concept of frequency of occuring of certain amino acids within the biological sequences, there are multiple strategies to compute this frequencies. Towards that we are going discuss about PAM Scoring Matrix.

The alignments are scored very nicely if we use an empirical scoring method. That is a scoring matrix based on experimental observed frequencies of amino acids, there are multiple types of matrix but the main matrices are PAM and BLOSSUM matrix.

So first we will discuss about PAM Scoring Matrix

PAM - Point Accepted Mutations Matrix.

Point accepted Mutation Matrix is a substitution of one amino acid by another such that the protein stays conserved.

Note: There are cases where the substitution of one amino acid by another amino acid changes protien, but in PAM those mutations(substitutions) are considered where the overall function of the protein stay conserved or stays the same.

PAM Unit:

PAM unit is the time in which 1% of amino acids in a sequence undergo accepted mutations and this will be PAM1.

Since sequences are long and there are a multiple neucliotides or amino acids in the sequences, then 1% of sequences changed does not necessarily mean that a 100% PAM will have 100% variation in the sequences, because the same site can be changed more than one time, ie. if the same site mutated multiple times then the entire sequence may not be changed for the entire protien.

To understand this we will consider a graph, in which the sequence difference and PAM Distance are compared

Figure 1:

In this a PAM Distance of 100 dosen't mean that there is 100 percentage change in the sequence, because one site on mutation will further mutate and will be accumulated with multiple mutations.

So the experimental data shows if you have a 100 PAM Distance then only 55 to 60 percentage of the sites in the protein are actually mutated. So for the 85 % variations the PAM Distance will increase over than 300.

figure 2

In figure 2 the value of k is 20 (ie 20 amino acids) you can calculate the PAM 1 by using this calculation.

If you want to find PAM 2 then just square the PAM1

similarly if you want to find PAM'n' then multiply PAM 1 'n' times.

If you multiply PAM 1 by 250 times then PAM 250 matrix will get the subsitution matrix like this

In conclusion PAM Matrixes are scoring matrixes that are used for comparing sequences. We can compute PAM 1, PAM2 etc upto PAM 250 or more and PAM 120 is considered as optimal scoring matrix

Deriving Scoring Matrix

We know the importances of Scoring Matrixes, Instead of scoring the same score for all matches or mismatches the scoring matrix allow us to have a flexible scoring scheme. Flexible scoring scheme means substitution of specific amino acids by others is scored differently and similarly for matches and mismatches all of the scores are computed based on the frequency of occurance of these amino acids in similar protein sequnces.

How do we arrive the Scoring Matrix?

We compare biological sequences by aligning them or we call it as Pair wise sequence alignment, and during this alignment we rank the matches between sequences using scoring matrix

By looking at the chemical properties of amino acids we can see which amino acid will ge higher score

and which amino acid substitution get a lower score.

In the above figure 20 amino acids are listed and colorful diagram shows the chemicalproperties as well.

For example,the green ones are the polar amino acids the brown ones as indicated are the charged amino acids, there are lots of hydrophobhic amino acids, hydroxilix amino acids, tiny amino acids( basically very small amino acids), small amino acids, acidic amino acids and basic amino acids. These are the different properties of amino acids that make them get substituted or conserved within the protein sequences

Figure2

Now let's consider the report generated by Robinson and Robinson several years ago.

In this the columns states that each amino acid has a specific frequency of occurance on average, that means Alanin has an average occurance of 0.078 on a scale of 1 similarly Argenine etc.

Now let's discuss how did they compute frequency, for that similar protein sequences was considered and counted the occurance of each amino acid an normalised it between zero and one.

Example:

Consider the case of three amino acids from the 20 amino acids inorder to simplify the problem. And also consider a small protein comprising of 4 amino acids.

Next count how many A's are converted into A, how many A's are converted into B's, how many A's are converted into C's,how many B's are converted into B's, how many B's are converted into C's, and how many C's are converted into C's

The observed frequencies fij in fig.5 shows the frequencies of ocuurance in the conversion, there are total 60 conversion and if you consider C to C then it is 2/60

fig.6

The C in sequence no.3 is converted to another C in sequence 5 likewise the C in sequence no.2 is converted to another C in sequence no.3 thus it got the count as 2 out of 60.

If we consider B to B then 6 B's are converted to other B's then we will get 6 over 60.

figure 7

Likewise others are also calculated.

Here fij denotes the conversion of one amino acid to another

figure 8

In fig.8 we can the observed frequencies fi which will be the observed frequencies of the entire set of proteins, ie the frequencies of each amino acids. If we consider the case of C the value 4 over 24 means the entire C in the particular sequence.

figure 9

Now considering

Here S(i,j) means for Scoring matrix any entry can be computed for example i, j by computing log to the second base of fij over pij's. So pij means the frequency of occurance of amino acids if they are the same ie pij = fifi if i equals to j , if they are different means i is not equal to j then pij equals to 2 times fi and fj.

Likewise pij is computed .

And you can see 2sij is also computed

And finally scoring matrix is obtained .

We can see in scoring matrix each element is an integer, so all elements in Sij into integers.

Finally the Scoring Matrix for the 3 amino acids is obtained.

In conclusion based on the frequency of certain amino acids we can compute the scoring matrix and then we can use them to compare sequences that are given for comparisons

Monday, August 24, 2020

Scoring Matrix in Bioinformatics

In Pairwise Sequence Alignment or comparison of two sequences as a pair is essentialy to compare their constitution. The constitution can be from amino acis or neucliotides. When You want to compare these two sequences then you want to score the result.But in the reallife due to the evolutionary pressures the rate of replacing the different amino acid by others is different.So therefore it is necessary to incorporate the variable propencity of replacement or substitution by suitable scores.In this goal we are helped by Scoring Matrix.

We will see what are the scoring matrixes, how we will build scoring matrixes and how we can use them in the alignment process.

Introduction of Scoring Matrix

An amino acid can be replaced by another amino acid based on their chemical ,physical and other special properties. so if two amino acids have same chemical behaviour then there is a high chance for their substitution during evolution, However if there is totaly different properties then there is very low chance of such an amino acid replaced by another amino acid

So the scoring Matrixes they are variable or flexible in scoring such substitions and therefore they have the substition for each amino acid scored differently, ie scoring matrixes have substitution value of each amino acid in a unique way.

How to build Scoring Matrix?

Consider the protein sequences that are there in nature, find protein sequences that are similar to each other and homologous to each other, so once we isolated the set of similar protein sequences then we see which amino acid in one sequence is substituted with which amino acid in the other sequences.In this way we build a frequency list of amino acid that is frequently an amino acid is substituted by another amino acid,

Scores in the Scoring Matrix

So the scoring Matrix by looking at such frequencies may contain a +ve value ie a very easy transition or substitution from one amino acid to another , -ve value which means a rare substitution and may be a zero as well.

Here is an example of a protein called Ubiquitin

List of ubiqutins are shown from humans, chimps, mouse etc. and their sequences have been aligned with each other as we can see some of the amino acids are raely substituted while some others are completely conserved while some other amino acids are changed.

So what we do in such a frequency count is to apply a formula

Here S will be the Scoring Matrix

a will be he first amino acid

b will be the second amino acid

So subtitution a by b will be scored by multiplying some constants

by the log of this ratio

Pab is the probability of amino acid 'a' substituted by amino 'b' where a nd b can be any two amino acids.

fa and fb is the frequencies of amino acids a and b

So by computing the S for a,b, and if you vary a and b to all the amino acids that is 20 amino acids you can arrive at the scoring matrix.

Let us consider a scoring matrix

Consider the negative score -5 when D is substituted by W or zeros when S is substitued by D Q G. The diagonal indicated by red line shows very high scores specially if C is substituted by another C the score is substituted by 13, which means C is mostly conserved.

So in Conclusion, A postive value like +1,+5 has been assigned for match, similarly -10, -2 for mismatches for all the amino acids the scoring matrixes selectively score or score differently for each amino acids depending on their chemical and physical properties which are reflected in their frequency of occurance.