6.2 Partition Functions 235
6.2.3 Codon Bias in Proteins
This statistical reasoning can also be used to analyze the codon bias in amino acids.
The problem there is as follows. Every protein consists of a number of amino acids;
each of which is, in turn, encoded by a sequence of three bases of the genetic code.
For example, Met (or Methonine) is encoded by the triplet ATG. Unlike Met, most
amino acids are encoded by more than one triplet, because the genetic code is de-
generate. As it turns out, however, the different codons are not equivalent. There
is a statistically significant bias in the usage frequency of individual codons for a
specific amino-acid, which varies from species to species. The underlying reason
for this bias seems to be that the tRNAs specific to a particular triplet of the ge-
netic code are not equally abundant either. Instead, some of the tRNAs are more
frequent than others. The more frequent a tRNA, the faster the relevant amino acid
can be incorporated into the growing protein by the ribosome. Therefore, the usage
of different codons has implications for the time it takes to translate a protein.
It turns out that not all amino-acids are encoded by the fastest codons. As a re-
sult, the majority of proteins are expressed with medium speed, while there are a
few highly optimized ones which use mostly very fast/abundant codons. Similarly,
very few proteins are coded for by predominantly rare codons. There are many bio-
logically aspects to this, but some initial understanding of the system can be reached
by considering simple models based on the partition functions.
To start, let us assume that every protein is under some selection pressure to be
expressed rapidly. This selection pressure will vary from protein to protein. Gener-
ally, of course, the faster a protein is expressed, the better. However, since proteins
are expressed simultaneously there is a competition for tRNA between them. In-
creasing the speed of one must decrease the speed of the other. Selection pressure
itself is not directly measurable, but within the framework of the partition function
we can model it as a preference, as in the case of the king and queen above.
Let us consider a protein of some length and consider a single amino acid within
the protein. Assume that the protein has N copies of this amino acid. Further assume
that this amino acid has n
c
different codons. Let us (arbitrarily) designate the first
codon as the fastest, i.e., the codon that has the highest number of tRNA and assign
to it some preference value G
1
=a. To simplify matters, let us now assume that the
other codons are much rarer than the first one and we assume they have, collectively,
a preference value of G
2
= b. The idea underlying the model is as follows: Over
evolutionary time scales, random mutations will lead to a random walk between the
individual codons. However, over time, those codons that are faster will be preferred
(by how much they will be preferred is expressed as a), which can be interpreted
as an evolutionary selection pressure. Given this model we can then ask about the
probabilities of observing various possible configurations.
At this level, the model reduces to understanding the probability of various
macro-states. Here the macro-states are defined as by the number of amino acids,
k, that are encoded by the most frequent codon. We formulate the partition func-
tion by considering the number of configurations that are compatible with exactly k
amino acids being encoded by the most frequent codon. Formally, this is the same