This chapter is the longest in the book as it deals with both general principles and practical aspects of sequence and, to a lesser degree, structure analysis. Although these methods are not, in themselves, part of genomics, no reasonable genome analysis and annotation would be possible without understanding how these methods work and having some practical experience with their use. Inappropriate use of sequence analysis procedures may result in numerous errors in genome annotation (we have already touched upon this subject in the previous chapter and further discuss it in Chapter 5). We attempted to strike a balance between generalities and specifics, aiming to give the reader a clear perspective of the computational approaches used in comparative and functional genomics, rather than discuss any one of these approaches in great detail. In particular, we refrained from any extensive discussion of the statistical basis and algorithmic aspects of sequence analysis because these can be found in several recent books on computational biology and bioinformatics (see 4.8), and no less importantly, because we cannot claim advanced expertise in this area. We also tried not to duplicate the “click here”-type tutorials, which are available on many web sites. However, we deemed it important to point out some difficult and confusing issues in sequence analysis and warn the readers against the most common pitfalls. This discussion is largely based on our personal, practical experience in analysis of protein families. We hope that this discussion might clarify some aspects of sequence analysis that “you always wanted to know about but were afraid to ask”. Still, we felt that it would be impossible (and unnecessary) to discuss all methods of sequence and structure analysis in one, even long, chapter and concentrated on those techniques that are central to comparative genomics. We hope that after working through this chapter, interested readers will be encouraged to continue their education in methods of sequence analysis using more specialized texts, including those in the ‘Further Reading’ list (see 4.8).
4.2. Principles of Sequence Similarity Searches
As discussed in the previous section, initial characterization of any new DNA or protein sequence starts with a database search aimed at finding out whether homologs of this gene (protein) are already available, and if they are, what is known about them. Clearly, looking for exactly the same sequence is quite straightforward. One can just take the first letter of the query sequence, search for its first occurrence in the database, and then check if the second letter of the query is the same in the subject. If it is indeed the same, the program could check the third letter, then the fourth, and continue this comparison to the end of the query. If the second letter in the subject is different from the second letter in the query, the program should search for another occurrence of the first letter, and so on. This will identify all the sequences in the database that are identical to the query sequence (or include it). Of course, this approach is primitive computation-wise, and there are sophisticated algorithms for text matching that do it much more efficiently .
Note that, in the example above, we looked only for sequences that exactly match the query. The algorithm would not even find a sequence that is identical to the query with the exception of the first letter. To find such sequences, the same analysis should be conducted with the fragments starting from the second letter of the original query, then from the third one, and so on.
Such search quickly becomes time-consuming, and we are still dealing only with identical sequences. Finding close relatives would introduce additional conceptual and technical problems. Let us assume that sequences that are 99% identical are definitely homologous. What should one select as the threshold to consider sequences not to be homologous: 50% identity, 33%, or perhaps 25%? These are legitimate questions that need to be answered before one goes any further. The example of two lysozymes (see 2.1.2) shows that sequences with as low as 8% identity may belong to orthologous proteins and perform the same function.
As a matter of fact, when comparing nucleic acid sequences, there is very little one could do. All the four nucleotides, A, T, C, and G, are found in the database with approximately the same frequencies and have roughly the same probability of mutating one into another. As a result, DNA-DNA comparisons are largely based on straightforward text matching, which makes them fairly slow and not particularly sensitive, although a variety of heuristics have been developed to overcome this .
Amino acid sequence comparisons have several distinct advantages over nucleotide sequence comparisons, which, at least potentially, lead to a much greater sensitivity. Firstly, because there are 20 amino acids but only four bases, an amino acid match carries with it >4 bits of information as opposed to only two bits for a nucleotide match. Thus, statistical significance can be ascertained for much shorter sequences in protein comparisons than in nucleotide comparisons. Secondly, because of the redundancy of the genetic code, nearly one-third of the bases in coding regions are under a weak (if any) selective pressure and represent noise, which adversely affects the sensitivity of the searches. Thirdly, nucleotide sequence databases are much larger than protein databases because of the vast amounts of non-coding sequences coming out of eukaryotic genome projects, and this further lowers the search sensitivity. Finally, and probably most importantly, unlike in nucleotide sequence, the likelihoods of different amino acid substitutions occurring during evolution are substantially different, and taking this into account greatly improves the performance of database search methods as described below. Given all these advantages, comparisons of any coding sequences are typically carried out at the level of protein sequences; even when the goal is to produce a DNA-DNA alignment (e.g. for analysis of substitutions in silent codon positions), it is usually first done with protein sequences, which are then replaced by the corresponding coding sequences. Direct nucleotide sequence comparison is indispensable only when non-coding regions are analyzed.
4.2.1. Substitution scores and substitution matrices
The fact that each of the 20 standard protein amino acids has its own unique properties means that the likelihood of the substitution of each particular residue for another residue during evolution should be different. Generally, the more similar the physico-chemical properties of two residues, the greater the chance that the substitution will not have an adverse effect on the protein's function and, accordingly, on the organism's fitness. Hence, in sequence comparisons, such a substitution should be penalized less than a replacement of amino acid residue with one that has dramatically different properties. This is, of course, an oversimplification, because the effect of a substitution depends on the structural and functional environment where it occurs. For example, a cysteine-to-valine substitution in the catalytic site of an enzyme will certainly abolish the activity and, on many occasions, will have a drastic effect on the organism's fitness. In contrast, the same substitution within a β-strand may have little or no effect. Unfortunately, in general, we do not have a priori knowledge of the location of a particular residue in the protein structure, and even with such knowledge, incorporating it in a database search algorithm is an extremely complex task. Thus, a generalized measure of the likelihood of amino acid substitutions is required so that each substitution is given an appropriate score (weight) to be used in sequence comparisons. The score for a substitution between amino acids i and j always can be expressed by the following intuitively plausible formula, which shows how likely a particular substitution is given the frequencies of each the two residues in the analyzed database:
where k is a coefficient, qij is the observed frequency of the given substitution, and pi, pj are the background frequencies of the respective residues. Obviously, here the product pipj is the expected frequency of the substitution and, if qij = pipj(Sij= 0), the substitution occurs just as often as expected. The scores used in practice are scaled such that the expected score for aligning a random pair of amino acid sequences is negative (see below).
There are two fundamentally different ways to come up with a substitution score matrix, i.e. a triangular table containing 210 numerical score values for each pair of amino acids, including identities (diagonal elements of the matrix; Figures 4.4 and 4.5). As in many other situations in computational biology, the first approach works ab initio, whereas the second one is empirical. One ab initio approach calculates the score as the number of nucleotide substitutions that are required to transform a codon for one amino acid in a pair into a codon for the other. In this case, the matrix is obviously unique (as long as alternative genetic codes are not considered) and contains only four values, 0, 1, 2, or 3. Accordingly, this is a very coarse grain matrix that is unlikely to work well. The other ab initio approach assigns scores on the basis of similarities and differences in the physico-chemical properties of amino acids. Under this approach, the number of possible matrices is infinite, and they may have as fine a granularity as desirable, but a degree of arbitrariness is inevitable because our understanding of protein physics is insufficient to make informed decisions on what set of properties “correctly” reflects the relationships between amino acids.
The PAM30 substitution matrix. The numbers indicate the substitution scores for each replacement. The greater the number, the lesser the penalty for the given substitution. Note the high penalty for replacing Cys and aromatic amino acids (Phe, Tyr, and (more...)
The BLOSUM 62 substitution matrix. The meaning of the numbers is the same as for PAM30. Note the relatively lower reward for conservation of Cys, Phe, Tyr, and Trp and lower penalties for replacing these amino acids than in the PAM30 matrix. This trend (more...)
Empirical approaches, which historically came first, attempt to derive the characteristic frequencies of different amino acid substitutions from actual alignments of homologous protein families. In other words, these approaches strive to determine the actual likelihood of each substitution occurring during evolution. Obviously, the outcome of such efforts critically depends on the quantity and quality of the available alignments, and even now, any alignment database is far from being complete or perfectly correct. Furthermore, simple counting of different types of substitutions will not suffice if alignments of distantly related proteins are included because, in many cases, multiple substitutions might have occurred in the same position. Ideally, one should construct the phylogenetic tree for each family, infer the ancestral sequence for each internal node, and then count the substitutions exactly. This is not practicable in most cases, and various shortcuts need to be taken.
Several solutions to these problems have been proposed, each resulting in a different set of substitution scores. The first substitution matrix, constructed by Dayhoff and Eck in 1968 , was based on an alignment of closely related proteins, so that the ancestral sequence could be deduced and all the amino acid replacements could be considered occurring just once. This model was then extrapolated to account for more distant relationships (we will not discuss here the mathematics of this extrapolation and the underlying evolutionary model ), which resulted in the PAM series of substitution matrices (Figure 4.4). PAM (Accepted Point Mutation) is a unit of evolutionary divergence of protein sequences, corresponding to one amino acid change per 100 residues. Thus, for example, the PAM30 matrix is supposed to apply to proteins that differ, on average, by 0.3 change per aligned residue, whereas PAM250 should reflect evolution of sequences with an average of 2.5 substitutions per position. Accordingly, the former matrix should be employed for constructing alignments of closely related sequences, whereas the latter is useful in database searches aimed at detection of distant relationships. Using an approach similar to that of Dayhoff, combined with rapid algorithms for protein sequence clustering and alignment, Jones, Taylor, and Thornton produced the series of the so-called JTT matrices , which are essentially an update of the PAMs.
The PAM and JTT matrices, however, have obvious limitations because of the fact that they have been derived from alignments of closely related sequences and extrapolated to distantly related ones. This extrapolation may not be fully valid because the underlying evolutionary model might not be adequate, and the trends that determine sequence divergence of closely related sequences might not apply to the evolution at larger distances.
In 1992, Steven and Jorja Henikoff developed a different series of substitution matrices  using conserved ungapped alignments of related proteins from the BLOCKS database (see 3.2.1). The use of these alignments offered three important advantages over the alignments used for constructing the PAM matrices. First, the BLOCKS collection obviously included a much larger number and, more importantly, a much greater diversity of protein families than the collection that was available to Dayhoff and coworkers in the 1970's. Second, coming from rather distantly related proteins, BLOCKS alignments better reflected the amino acid changes that occur over large phylogenetic distances and thus produced substitution scores that represented sequence divergence in distant homologs directly, rather than through extrapolation. Third, in these distantly related proteins, BLOCKS included only the most confidently aligned regions, which are likely to best represent the prevailing evolutionary trends. These substitution matrices, named the BLOSUM (= BLOcks SUbstitution Matrix) series, were tailored to particular evolutionary distances by ignoring the sequences that had more than a certain percent identity. In the BLOSUM62 matrix, for example, the substitution scores were derived from the alignments of sequences that had no more than 62% identity; the substitution scores of the BLOSUM45 matrix were calculated from the alignments that contained sequences with no more than 45% identity. Accordingly, BLOSUM matrices with high numbers, such as BLOSUM80, are best suited for comparisons of closely related sequences (it is also advisable to use BLOSUM80 for database searches with short sequences, see 4.4.2), whereas low-number BLOSUM matrices, such as BLOSUM45, are better for distant relationships. In addition to the general-purpose PAM, JTT, and BLOSUM series, some specialized substitution matrices were developed, for example, for integral membrane proteins , but they never achieved comparable recognition.
Several early studies found the PAM matrices based on empirical data consistently resulted in greater search sensitivity than any of the ab initio matrices (see ). An extensive empirical comparison showed that: (i) BLOSUM matrices consistently outperformed PAMs in BLAST searches, and (ii) on average, BLOSUM62 (Fig 4.5) performed best in the series ; this matrix is currently used as the default in most sequence database searches. It is remarkable that so far, throughout the 30-plus-year history of amino acid substitution matrices, empirical matrices have consistently outperformed those based on theory, either physico-chemical or evolutionary. This is not to say, of course, that theory is powerless in this field, but to point out that we currently do not have a truly adequate theory to describe protein evolution. Clearly, the last word has not been said on amino acid substitution matrices, and one can expect that eventually the BLOSUM series will be replaced by new matrices based on greater amounts of higher quality alignment data and more realistic evolutionary models. A recently reported maximum-likelihood model for substitution frequency estimation has already been claimed to describe individual protein families better than the Dayhoff and JTT models . It remains to be seen how this and other new matrices perform in large-scale computational experiments on real databases. AAindex (http://www.genome.ad.jp/dbget/aaindex.html, see 3.6.3) lists 66 different substitution matrices, both ab initio and empirical, and there is no doubt that this list will continue to grow .
4.2.2. Statistics of protein sequence comparison
It is impossible to explain even the basic principles of statistical analysis of sequence similarities without invoking some mathematics. To introduce these concepts in the least painful way, let us consider the same protein sequence (E. coli RpsJ) as above
and check how many times segments of this sequence of different lengths are found in the database (we chose fragments starting from the second position in the sequence because nearly every protein in the database starts with a methionine). Not unexpectedly, we find that the larger the fragment, the smaller the number of exact matches in the database (Table 4.4).
Dependence of the number of exact database matches on the length of the query word.
Perhaps somewhat counterintuitively, a 9-mer is already unique. With the decrease in the number of database hits, the likelihood that these hits are biologically relevant, i.e. belong to homologs of the query protein, increases. Thus, 13 of the 23 occurrences of the string KVRASV and all 8 occurrences of the string KVRASVK are from RpsJ orthologs.
The number of occurrences of a given string in the database can be roughly estimated as follows. The probability of matching one amino acid residue is 1/20 (assuming equal frequencies of all 20 amino acids in the database; this not being the case, the probability is slightly greater). The probability of matching two residues in a row is then (1/20)2, and the probability of matching n residues is (1/20)n. Given that the protein database currently contains N ~2 × 108 letters, one should expect a string of n letters to match approximately N × (1/20)n times, which is fairly close to the numbers in Table 4.4.
Searching for perfect matches is the simplest and, in itself, obviously insufficient form of sequence database search, although, as we shall see below, it is important as one of the basic steps in currently used search algorithms. As repeatedly mentioned above, the goal of a search is finding homologs, which can have drastically different sequences such that, in distant homologs, only a small fraction of the amino acid residues are identical or even similar. Even in close homologs, a region of high similarity is usually flanked by dissimilar regions like in the following alignment of E. coli RpmJ with its ortholog from Vibrio cholerae:
In this example, the region of highest similarity is in the middle of the alignment, but including the less conserved regions on both sides improves the overall score (taking into account the special treatment of gaps, which is introduced below). Further along the alignment, the similarity almost disappears so that inclusion of additional letters into the alignment would not increase the overall score or would even decrease it. Such fragments of the alignment of two sequences whose similarity score cannot be improved by adding or trimming any letters, are referred to as high-scoring segment pairs (HSPs). For this approach to work, the expectation of the score for random sequences must be negative, and the scoring matrices used in database searches are scaled accordingly (see Figures 4.4 and 4.5).
So, instead of looking for perfect matches, sequence comparisons programs actually search for HSPs. Once a set of HSPs is found, different methods, such as Smith-Waterman, FASTA, or BLAST, deal with them in different fashions (see below). However, the principal issue that any database search method needs to address is identifying those HSPs that are unlikely to occur by chance and, by inference, are likely to belong to homologs and to be biologically relevant. This problem has been solved by Samuel Karlin and Stephen Altschul, who showed that maximal HSP scores follow the extreme value distribution . Accordingly, if the lengths of the query sequence (m) and the database (n) are sufficiently high, the expected number of HSPs with a score of at least S is given by the formula
Here, S is the so-called raw score calculated under a given scoring system, and K and λ are natural scaling parameters for the search space size and the scoring system, respectively. Normalizing the score according to the formula:
gives the bit score, which has a standard unit accepted in information theory and computer science. Then,
and, since it can be shown that the number of random HSPs with score ≥S’ is described by Poisson distribution, the probability of finding at least one HSP with bit score ≥S’ is
Equation 4.5 links two commonly used measures of sequence similarity, the probability (P-value) and expectation (E-value). For example, if the score S is such that three HSPs with this score (or greater) are expected to be found by chance, the probability of finding at least one such HSP is (1 − e−3), ~0.95. By definition, P-values vary from 0 to 1, whereas E-values can be much greater than 1. The BLAST programs (see below) report E-values, rather than P-values, because E-values of, for example, 5 and 10 are much easier to comprehend than P-values of 0.993 and 0.99995. However, for E < 0.01, P-value and E-value are nearly identical.
The product mn defines the search space, a critically important parameter of any database search. Equations 4.2 and 4.4 codify the intuitively obvious notion that the larger the search space, the higher the expectation of finding an HSP with a score greater than any given value. There are two corollaries of this that might take some getting used to: (i) the same HSP may come out statistically significant in a small database and not significant in a large database; with the natural growth of the database, any given alignment becomes less and less significant (but by no means less important because of that) and (ii) the same HSP may be statistically significant in a small protein (used as a query) and not significant in a large protein.
Clearly, one can easily decrease the E-value and the P-value associated with the alignment of the given two sequences by lowering n in equation 4.2, i.e. by searching a smaller database. However, the resulting increase in significance is false, although such a trick can be useful for detecting initial hints of subtle relationships that should be subsequently verified using other approaches. It is the experience of the authors that the simple notion of E(P)-value is often misunderstood and interpreted as if these values applied just to a single pairwise comparison (i.e., if an E-value of 0.001 for an HSP with score S is reported, then, in a database of just a few thousand sequences, one expects to find a score >S by chance). It is critical to realize that the size of the search space is already factored in these E-values, and the reported value corresponds to the database size at the time of search (thus, it is certainly necessary to indicate, in all reports of sequence analysis, which database was searched, and desirably, also on what exact date).
Speaking more philosophically (or futuristically), one could imagine that, should the genomes of all species that inhabit this planet be sequenced, it would become almost impossible to demonstrate statistical significance for any but very close homologs in standard database searches. Thus, other approaches to homology detection are required that counter the problems created by database growth by taking advantage of the simultaneously increasing sequence diversity, and as discussed below, some have already been developed.
The Karlin-Altschul statistics has been rigorously proved to apply only to sequence alignments that do not contain gaps, whereas statistical theory for the more realistic gapped alignments remains an open problem. However, extensive computer simulations have shown that these alignments also follow the extreme value distribution to a high precision; therefore, at least for all practical purposes, the same statistical formalism is applicable [19,22].
Those looking for a detailed mathematical description of the sequence comparison statistics can find it in the “Further Reading” list at the end of the chapter. A brief explanation of the statistical principles behind the BLAST program, written by Stephen Altschul, can be found online at http://www.ncbi.nlm.nih.gov/BLAST/tutorial/Altschul-1.html.
4.2.3. Protein sequence complexity: Compositional bias
The existence of a robust statistical theory of sequence comparison, in principle, should allow one to easily sort search results by statistical significance and accordingly assign a level of confidence to any homology identification. However, a major aspect of protein molecule organization substantially complicates database search interpretation and may lead to gross errors in sequence analysis. Many proteins, especially in eukaryotes, contain low (compositional) complexity regions, in which the distribution of amino acid residues is non-random, i.e. deviates from the standard statistical model [919,921]. In other words, these regions typically have biased amino acid composition, e.g. are rich in glycine or proline, or in acidic or basic amino acid residues. The ultimate form of low complexity is, of course, a homopolymer, such as a Q-linker . Other low-complexity sequences have a certain amino acid periodicity, sometimes subtle, such as, for example, in coiled-coil and other non-globular proteins (e.g. collagen or keratin).
The notion of compositional complexity was encapsulated in the SEG algorithm and the corresponding program, which partitions protein sequences into segments of low and high (normal) complexity . An important finding made by John Wootton is that low-complexity sequences correspond to non-globular portions of proteins . In other words, a certain minimal level of complexity is required for a sequence to fold into a globular structure. Low-complexity regions in proteins, although devoid of enzymatic activity, have important biological functions, most often promoting protein-protein interactions or cellular adhesion to various surfaces and to each other.
In a detailed empirical study, a set of parameters of the SEG program was identified that allowed reasonably accurate partitioning of a protein sequence into predicted globular and non-globular parts. The mastermind protein of Drosophila is a component of the Notch-dependent signaling pathway and plays an important role in the development of the nervous system of the fruit fly [755,785]. In spite of this critical biological function, this protein consists mostly of stretches of only three amino acid residues, Gln, Asn, and Gly, and is predicted to have a predominantly non-globular structure (Figure 4.6). Recently discovered human homologs of mastermind are also involved in Notch-dependent transcriptional regulation and similarly appear to be almost entirely non-globular [442,923].
Sequence of the Drosophila mastermind protein: partitioning into predicted non-globular (left column) and globular (right column) regions. The SEG program was run with the parameters optimized for detection of non-globular regions: window length 45, (more...)
Low-complexity regions represent a major problem for database searches. Since the λ parameter of equation 4.2 is calculated for the entire database, Karlin-Altschul statistics breaks down when the composition of the query or a database sequence or both significantly deviates from the average composition of the database. The result is that low-complexity regions with similar composition (e.g. acidic or basic) often produce “statistically significant” alignments that have nothing to with homology and are completely irrelevant. The SEG program can be used to overcome this problem in a somewhat crude manner: the query sequence, the database, or both can be partitioned into normal complexity and low-complexity regions, and the latter are masked (i.e. amino acid symbols are replaced with the corresponding number of X's). For the purpose of a database search, such filtering is usually done using short windows so that only the segments with a strongly compositional bias are masked. Low-complexity filtering has been indispensable for making database search methods, in particular BLAST, into reliable tools . Without masking low-complexity regions, false results would have been produced for a substantial fraction of proteins, especially eukaryotic ones (an early estimate held that low-complexity regions comprise ~15% of the protein sequences in the SWISS-PROT database ). These false results would have badly polluted any large-scale database search, and the respective proteins would have been refractory to any meaningful sequence analysis. For these reasons, for several years, SEG filtering had been used as the default for BLAST searches to mask low-complexity segments in the query sequence. However, this procedure is not without its drawbacks. Not all low-complexity sequences are captured, and false-positives still occur in database searches. The opposite problem also hampers database searches for some proteins: when short low-complexity sequences are parts of conserved regions, statistical significance of an alignment may be underestimated, sometimes grossly.
In a recent work of Alejandro Schäffer and colleagues, a different, less arbitrary approach for dealing with compositionally biased sequences was introduced . This method, called composition-based statistics, recalculates the λ parameter and, accordingly, the E values [see equation 4.2] for each query and each database sequence, thus correcting the inordinately low (“significant”) E-values for sequences with similarly biased amino acid composition. This improves the accuracy of the reported E-values and eliminates most false-positives. Composition-based statistics is currently used as the default for the NCBI BLAST. In 4.4.2, we will discuss the effect of this procedure on database search outcome in greater detail and using specific examples.
4.3. Algorithms for Sequence Alignment and Similarity Search
4.3.1. The basic alignment concepts and principal algorithms
As discussed in the previous sections, similarity searches aim at identifying the homologs of the given query protein (or nucleotide) sequence among all the protein (or nucleotide) sequences in the database. Even in this general discussion, we repeatedly mentioned and, on some occasions, showed sequence alignments. An alignment of homologous protein sequences reveals their common features that are ostensibly important for the structure and function of each of these proteins; it also reveals poorly conserved regions that are less important for the common function but might define the specificity of each of the homologs. In principle, the only way to identify homologs is by aligning the query sequence against all the sequences in the database (below we will discuss some important heuristics that allow an algorithm to skip sequences that are obviously unrelated to the query), sorting these hits based on the degree of similarity, and assessing their statistical significance that is likely to be indicative of homology. Thus, before considering algorithms and programs used to search sequence databases, we must briefly discuss alignment methods themselves.
It is important to make a distinction between a global (i.e. full-length) alignment and a local alignment, which includes only parts of the analyzed sequences (subsequences). Although, in theory, a global alignment is best for describing relationships between sequences, in practice, local alignments are of more general use for two reasons. Firstly, it is common that only parts of compared proteins are homologous (e.g. they share one conserved domain, whereas other domains are unique). Secondly, on many occasions, only a portion of the sequence is conserved enough to carry a detectable signal, whereas the rest have diverged beyond recognition. Optimal global alignment of two sequences was first realized in the Needleman-Wunsch algorithm, which employs dynamic programming . The notion of optimal local alignment (the best possible alignment of two subsequences from the compared sequences) and the corresponding dynamic programming algorithm were introduced by Smith and Waterman . Both of these are O(n2) algorithms, i.e. the time and memory required to generate an optimal alignment are proportional to the product of the lengths of the compared sequences (for convenience, the sequences are assumed to be of equal length n in this notation). Optimal alignment algorithms for multiple sequences have the O(nk) complexity (where k is the number of compared sequences). Such algorithms for k > 3 are not feasible on any existing computers, therefore all available methods for multiple sequence alignments produce only approximations and do not guarantee the optimal alignment.
It might be useful, at this point, to clarify the notion of optimal alignment. Algorithms like Needleman-Wunsch and Smith-Waterman guarantee the optimal alignment (global and local, respectively) for any two compared sequences. It is important to keep in mind, however, that this optimality is a purely formal notion, which means that, given a scoring function, the algorithm outputs the alignment with the highest possible score. This has nothing to with statistical significance of the alignment, which has to be estimated separately (e.g. using the Karlin-Altschul statistics as outlined above), let alone the biological relevance of the alignment.
For better or worse, alignment algorithms treat protein or DNA as simple strings of letters without recourse to any specific properties of biological macromolecules. Therefore, it might be useful to illustrate the principles of local alignments using a text free of biological context as an example. Below is the text of stanzas I and IV of one of the most famous poems of all times; we shall compare them line by line, observing along the way various problems involved in sequence alignment (the alignable regions are shown in bold):
It is easy to see that, in the first two lines of the two stanzas, the longest common string consists of only five letters, with one mismatch:
The second lines align better, with two similar blocks separated by spacers of variable lengths, which requires gaps to be introduced, in order to combine them in one alignment:
In the third lines, there are common words of seven, four, and six letters, again separated by gaps:
The fourth lines align very well, with a long string of near identity at the end:
In contrast, there is no reasonable alignment between the fifth lines, except for the identical word ‘door’. Obviously, however, the fourth line of the second stanza may be aligned not only with the fourth (IV), but also with the fifth line of the first stanza:
Alignments (IV) and (IV′) can thus be combined to produce a multiple alignment:
Finally, sixth lines of the two stanzas could be aligned at their ends:
This simple example seems to capture several important issues that emerge in sequence alignment analysis. Firstly, remembering that an optimal alignment can be obtained for any two sequences, we should ask: Which alignments actually reflect homology of the respective lines? The alignments III, IV, IV′ (and the derivative IV′′), and V seem to be relevant beyond reasonable doubt. However, are they really correct? In particular, aligning en-ly/ently in III and ntly/ntly in IV require introducing gaps into both sequences. Is this justified? We cannot answer this simple question without a statistical theory for assessing the significance of an alignment, including a way to introduce some reasonable gap penalties.
The treatment of gaps is one of the hardest and still unsolved problems of alignment analysis. There is no theoretical basis for assigning gap penalties relative to substitution penalties (scores). Deriving these penalties empirically is a much more complicated task than deriving substitution penalties as in PAM and BLOSUM series because, unlike the alignment of residues in highly conserved blocks, the number and positions of gaps in alignments tend to be highly uncertain (see, for example alignment IV: Is it correct to place gaps both before and after ‘so’ in the second line?). Thus, gap penalties typically are assigned on the basis of two notions that stem both from the existing understanding of protein structure and from empirical examinations of protein family alignments: (i) deletion or insertion resulting in a gap is much less likely to occur than even the most radical amino acid substitution and should be heavily penalized, and (ii) once a deletion (insertion) has occurred in a given position, deletion or insertion of additional residues (gap extension) becomes much more likely. Therefore a linear function:
where a is the gap opening penalty, b is the gap extension penalty, and x is the length of the gap is used to deal with gaps in most alignment methods. Typically, a = 10 and b = 1 is a reasonable choice of gap penalties to be used in conjunction with the BLOSUM62 matrix. Using these values, the reader should be able to find out whether gaps should have been introduced in alignments III and IV above. In principle, objective gap penalties could be produced through analysis of distributions of gaps in structural alignments, and such a study suggested using convex functions for gap penalties . However, this makes alignment algorithms much costlier computationally, and the practical advantages remain uncertain, so linear gap penalties are still universally employed.
The feasibility of alignments (IV) and (IV’) creates the problem of choice: Which of these is the correct alignment? Alignment (IV) wins because it clearly has a longer conserved region. What is, then, the origin of line 5 in the first stanza and, accordingly, of alignment (IV′)? It is not too difficult to figure out that this is a repeat, a result of duplication of line 4 (this is what we have to conclude given that line 4 is more similar to the homologous line in the second stanza). Such duplications are common in protein sequences, too, and often create major problems for alignment methods.
Ntoulas, A., Cho, J., Olston, C.: What’s new on the web?: the evolution of the web from a search engine perspective. In: 13th international conference on WWW, pp. 1–12. ACM Press, New York, NY, USA (2004)Google Scholar
Baeza-Yates, R., Castillo, C.: Relating web structure and user search behavior (extended poster). In: 10th International Conference on WWW, Hong Kong, China (2001)Google Scholar
Baeza-Yates, R., Hurtado, C., Mendoza, M., Dupret, G.: Modeling user search behavior. In: LA-WEB 2005, p. 242. IEEE Computer Society Press, Los Alamitos (2005)Google Scholar
Nettleton, D.F., Baeza-Yates, R.: Busqueda de información en la web: técnicas para la agrupación y selección de las consultas y sus resultados. In: CEDI LFSC, Granada, Spain (2005)Google Scholar
Sugiyama, K., Hatano, K., Yoshikawa, M.: Adaptive web search based on user profile constructed without any effort from users. In: 13th international conference on WWW, pp. 675–684. ACM Press, New York (2004)Google Scholar
Lee, U., Liu, Z., Cho, J.: Automatic identification of user goals in web search. In: 14th international conference on WWW, Chiba, Japan, pp. 391–400. ACM Press, New York (2005)CrossRefGoogle Scholar
Broder, A.: A taxonomy of web search. SIGIR Forum 36(2), 3–10 (2002)CrossRefGoogle Scholar
Kohonen, T.: Self organization and associative memory. Springer Series in Information Sciences, vol. 8. Springer, Heidelberg (1988)MATHGoogle Scholar
Quinlan, J.R.: C4.5: programs for machine learning. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA (1993)Google Scholar
Hunt, E.B.: Artificial Intelligence. Academic Press, New York (1975)MATHGoogle Scholar