Tuesday, August 25, 2020

Deriving Scoring Matrix

 We know the importances of Scoring Matrixes, Instead of scoring the same score for all matches or mismatches the scoring matrix allow us to have a flexible scoring scheme. Flexible scoring scheme means substitution of specific amino acids by others is scored differently and similarly for matches and mismatches all of the scores are computed based on the frequency of occurance of these amino acids in similar protein sequnces.

How do we arrive the Scoring Matrix?

We compare biological sequences by aligning them or we call it as Pair wise sequence alignment, and during this alignment we rank the matches between sequences using scoring matrix


By looking at the chemical properties of amino acids we can see which amino acid will ge higher score

and which amino acid substitution get a lower score.













In the above figure 20 amino acids are listed and colorful diagram shows the chemicalproperties as well.

For example,the green ones are the polar amino acids the brown ones as indicated are the charged amino acids, there are lots of hydrophobhic amino acids, hydroxilix amino acids, tiny amino acids( basically very small amino acids), small amino acids, acidic amino acids and basic amino acids. These are the different properties of amino acids that make them get substituted or conserved within the protein sequences


Figure2

Now let's consider the report generated by Robinson and Robinson several years ago.

In this the columns states that each amino acid has a specific frequency of occurance on average, that means Alanin has an average occurance of 0.078 on a scale of 1 similarly Argenine etc.

Now let's discuss how did they compute frequency, for that similar protein sequences was considered and counted the occurance of each amino acid an normalised it between zero and one.

Example: 

Consider the case of three amino acids from the 20 amino acids inorder to simplify the problem. And also consider a small protein comprising of 4 amino acids.









Next count how many A's are converted into A, how many A's are converted into B's, how many A's are converted into C's,how many B's are converted into B's, how many B's are converted into C's, and how many C's are converted into C's










The observed frequencies  fij in fig.5 shows the frequencies of ocuurance in the conversion, there are total 60 conversion and if you consider C to C then it is 2/60








fig.6

The C in sequence no.3 is converted to another C in sequence 5 likewise the C in sequence no.2 is converted to another C in sequence no.3 thus it got the count as 2 out of 60. 

If we consider B to B then 6 B's are converted to other B's  then we will get 6 over 60.






figure 7

Likewise others are also calculated.

Here fij denotes the conversion of one amino acid to another






figure 8

In fig.8 we can the observed frequencies fi which will be the observed frequencies of the entire set of proteins, ie the frequencies of each amino acids. If we consider the case of C the value 4 over 24 means the entire C in the particular sequence.













figure 9

Now considering




Here S(i,j) means for Scoring matrix any entry can be computed for example i, j by computing log to the second base of fij over pij's. So pij means the frequency of occurance of amino acids if they are the same ie pij = fifi if i equals to j , if they are different means i is not equal to j then pij equals to 2 times fi and fj.







Likewise pij is computed . 

And you can see 2sij is also computed 







And finally scoring matrix is obtained .





We can see in scoring matrix each element is an integer, so all elements in Sij into integers.

Finally the Scoring Matrix for the 3 amino acids is obtained.

In conclusion based on the frequency of certain amino acids we can compute the scoring matrix and then we can use them to compare sequences that are given for comparisons












No comments:

Post a Comment