Airing the word on pollution.

Biologically significant sites in a protein may be identified by contrasting the rates of synonymous (K s) and non-synonymous (K a) substitutions. This enables the inference of site-specific positive Darwinian selection and purifying selection. We present here Selecton version 2.2 (http://selecton. bioinfo.tau.ac.il), a web server which automatically calculates the ratio between K a and K s (u) at each site of the protein. This ratio is graphically displayed on each site using a color-coding scheme, indicating either positive selection, purifying selection or lack of selection. Selecton implements an assembly of different evolutionary models, which allow for statistical testing of the hypothesis that a protein has undergone positive selection. Specifically, the recently developed mechanistic-empirical model is introduced, which takes into account the physicochemical properties of amino acids. Advanced options were introduced to allow maximal fine tuning of the server to the user's specific needs, including calculation of statistical support of the u values, an advanced graphic display of the protein's 3-dimensional structure, use of different genetic codes and inputting of a pre-built phylogenetic tree. Selecton version 2.2 is an effective, user-friendly and freely available web server which implements up-to-date methods for computing site-specific selection forces, and the visualization of these forces on the protein's sequence and structure.


INTRODUCTION
Current protein sequences have been shaped by a prolonged and extensive evolutionary process. Thus, studying a protein's evolutionary history may contribute to the inference of the proteins' properties. Evolutionary conserved sites may be indicative of active sites or of protein-protein interaction domains, while highly variable sites may represent sites subjected to positive Darwinian selection (1,2). Such positively selected sites may be interpreted as being a consequence of molecular adaptation, which confers an evolutionary advantage to the organism. Detecting the level of selection operating on a given protein is enabled by computing o, the ratio between non-synonymous (K a ) and synonymous (K s ) substitutions.
Sites showing o values significantly higher than one are indicative of positive Darwinian selection, while sites showing o values significantly lower than one are indicative of purifying selection (2,3).
Selecton was first introduced in 2005 (4), and implemented an evolutionary codon model (5) which enabled calculating o at each codon site using a maximumlikelihood (ML) approach. With the advent of sequenced genomes, the comparison of DNA sequences in order to infer meaningful information has become a basic procedure for many researchers. The study of the selection forces operating on a protein, and specifically the attempt to identify positive selection in proteins has been on the rise, and with this rise the number of users of Selecton has increased dramatically. This has motivated us to upgrade the server to include most recent stateof-the-art methods for calculating positive selection. We have thus implemented five evolutionary models (6)(7)(8) that enable studying the selection forces operating on a protein. Specifically, these models enable testing whether positive selection has operated on the gene under study. This is achieved by comparing between a null model assuming no positive selection, and a model which allows positive selection. We note that positive selection is not a common phenomenon and is not apparent in most examined datasets (9,10). In these cases, Selecton can be used to accurately infer sites undergoing purifying selection. Both positive and purifying selections are calculated by inferring o at each codon site. All calculations explicitly take into account the phylogenetic relations among the sequences and the underlying stochastic process of evolution. The value of o at each site is then translated to a discrete color scale, and projected onto one of the homologous sequences specified by the user. If a 3-dimensional (3D) structure of the protein is available, the scores will also be projected onto the Van-der-Waals surface of the protein.
Various other web servers are available for analyzing the evolutionary forces operating on a gene, including servers which perform analyses only on an amino-acid alignment (e.g. 11,[12][13][14], and servers which perform analyses of positive selection in codon-based alignments (15)(16)(17)(18). For instance, The SNAP server (15) is based on counting methods which estimate the number of synonymous and non-synonymous substitutions between all pairs of sequences (19,20), and the Data-Monkey server (16) is based on a combination of model hypothesis testing and codon ancestral sequence reconstruction. The advantage of Selecton over these alternative web servers is that it combines the implementation of state-of-the-art methods for detecting positive selection, in a highly user-friendly interface which requires no previous expert knowledge. This enables the vast community of molecular biology researchers to study positive and purifying selection operating on a protein. Furthermore, expert users may use the advanced options to fine-tune the server to their exact requirements.
We provide here a brief review of Selecton, emphasizing the novel features added in version 2.2 of the server. We further present the results of the server when analyzed on the TRIM5a protein, a viral host-defense factor that has recently reached the spotlight of positive selection studies (3,21). We delineate how the use of the Selecton server enables the effortless detection of the primate species-specific retroviral restriction domain of TRIM5a found previously (21,22).

Evolutionary models implemented
Several different evolutionary models were implemented in Selecton version 2.2, each emphasizing different biological phenomena: M7 and M8: These two models were first introduced by Yang et al. (6). In brief, the M8 model assumes that o values come from a mixture of a discrete beta distribution, and an additional category o s 5 1 which allows for positive selection. The M8 is the default model for Selecton runs. The M7 model, nested within M8, does not include the additional o s category. Since the beta distribution is defined only on the interval [0,1], it thus follows that M7 does not allow for positive selection in the protein.
M8a: This model is a variation on the M8 model. In M8a, the additional category o s is set to 1. Thus, this model allows only for purifying and neutral selection (7). M5: This model assumes a gamma distribution over o (6). MEC: This model (8) is the only model here which takes into account the differences between amino-acid replacement rates. In brief, this model expands a 20 by 20 aminoacid replacement rate matrix [such as the commonly used JTT matrix (23)] into a 61 by 61 sense-codon rate matrix. Hence, when the non-synonymous ratio K a is inferred, the different replacement probabilities between amino acids with distinct properties are taken into account. For instance, all other models in Selecton assume that the evolutionary rates of leucine (UUG) being replaced by either tryptophan (UGG) or phenylalanine (UUU) are equal, since both require one transversion. However, according to the JTT matrix the latter is five times more likely than the former. Thus, under the MEC model, a position with radical replacements will obtain a higher K a value than a position with more moderate replacements. It should be noted that in the MEC model, o values are not directly equivalent to the K a /K s ratios. o values are used to calculate these ratios, which are later color-coded onto the results For a more elaborate explanation, refer to the Selecton FAQ section (http:// selecton.bioinfo.tau.ac.il/faq.html).
Comparison of these models allows for statistical testing of the hypothesis that there is positive selection operating on the protein (H 1 ), by contrasting this hypothesis to a null model (H 0 ). Part of the Selecton output is the likelihood of each model, allowing for comparison using either a likelihood ratio test (LRT) if the models are nested, or by comparing the second order Akaike Information Criterion (AIC C ) (24) scores if they are not. In brief, an LRT consists of comparing twice the loglikelihood difference of both models to a 2 table. An alternative approach is to compare the AIC C scores defined by À2 Á log L þ 2p Á ðN=N À p À 1Þ, where L represents the likelihood of the model given the data, p represents the number of free parameters and N represents the sequence length. The lower the AIC C score, the better the fit of the model to the data, and hence the model is considered more justified.
We recommend performing one of the following comparisons: M8 against M8a (nested models with one degree of freedom). This is the default comparison performed by Selecton. M8 against M7 (nested models with two degrees of freedom). MEC against M8a (non-nested models: five free parameters and four free parameters, respectively) For a comparison on the advantages and disadvantages of each model, please refer to the Selecton FAQ section (http://selecton.bioinfo.tau.ac.il/faq.html).
To enable easier use of Selecton, at the end of a run with a model which allows for positive selection, the user may click on a button labeled 'Test Statistical Significance'. This will run Selecton with the appropriate null model, perform the likelihood comparison (LRT or AIC c ) and output the significance level of this comparison. It should be noted that in the first two nested comparisons, nesting is achieved by fixing one of the parameters on the boundary of the parameter space (6,25), requiring caution when comparing the models with an LRT. Nevertheless, it has been previously shown that using a 2 1 to approximate the distribution of the LRT when comparing the M8 versus M8a leads to a conservative approach (25). This approach was adopted by Selecton.

Parameter estimation
Parameters common to all models used in Selecton are codon equilibrium frequencies i , the transition transversion ratio k and the phylogenetic tree branch lengths.
i are calculated as in (6,26) using the products of the observed nucleotide frequencies (also known as F3X4). k and branch lengths are all ML estimates. We use an expectation maximization approach (27) to solve the problem of multivariate optimization in the case of branch lengths [a similar approach is described in detail in (28,29)].

An empirical Bayesian method for calculating u values
The heart of the Selecton server is the calculation of the o values at each codon position. In the previous version of the server, the ML method (4) was implemented as the sole method of calculation. Recently, we have shown that an empirical Bayesian method can significantly improve the accuracy of inference of conservation scores (30). While the ML method was found to have a relatively high level of false positives, the Bayesian method showed an improved specificity that reduced the level of false positives. The empirical Bayesian method is particularly superior to the ML method when the number of homologous sequences analyzed is small (30). Thus, an empirical Bayesian method of calculating o values was implemented in the server (6). Following Yang (31), the distributions are approximated using eight discrete categories (the user may define a different number if desired) and the o values are computed by calculating the expectation of the posterior o distribution. It should be noted that although more reliable than the ML method, the empirical Bayesian method also suffers from inaccuracy in small data sets, mostly due to sampling errors in the estimation of parameters (such as the distribution shape parameters). Recently two alternatives have been proposed: the full Bayesian estimation (32) and the hierarchical Bayesian estimation (33). However, both alternatives are much more computationally intensive, and hence were not implemented in Selecton.

Genetic code
Eleven different genetic codes were implemented, including four different nuclear code variants and seven different mitochondrial code variants. This allows the analysis of genes from organisms and organelles which use nonstandard genetic codes.

Phylogenetic tree
By default Selecton runs are carried out using phylogenetic trees that the server computes using the neighborjoining algorithm (34). As input for the neighbor-joining algorithm, pairwise distances are computed applying the ML criterion under a codon model (35) which assumes no selection (o = 1 for all sites and k = 2). We note that in general, more accurate tree topologies lead to better estimate of parameters (36,37). However, it was previously shown that the detection of positive selection is in general robust to tree topology inaccuracies (6,38,39). Hence, the strategy we adopted in Selecton was to avoid the computationally intense search for the precise tree topology, yet to allow the user to provide a pre-computed phylogenetic-tree as an additional input. This new feature enables users to supply a more accurate tree, if available. Additionally, users can supply the tree phylogeny and have the server optimize its branch lengths.

Precision level
The precision level of the computations is defined by setting the cutoff (e), which defines when two likelihood values have converged. Selecton allows the user to choose between three levels of precision, which also directly affect the speed of calculation: low (e = 1), intermediate (e = 0.1) and high (e = 0.01). The default level of precision for Selecton runs is intermediate.

BIOLOGICAL EXAMPLE
We illustrate the power of Selecton to detect site-specific selection forces by analyzing the evolution of the TRIM5a protein, a protein that has recently been shown to have undergone extensive positive selection during the course of primate evolution (21,22). Furthermore, positively selected regions were found to correlate with the species-specificity determinants of the protein. Here, we wish to exemplify the ease with which Selecton enables detecting the species-specific viral restriction domains of TRIM5a.

Study of TRIM5a
TRIM5 is a member of the large tripartite motif family in primate genomes, characterized by having RING finger, B-box and coiled-coil domains, as well as an additional SPRY domain found in the a isoform (40). TRIM5a was found to account for HIV-1 resistance observed in rhesus cells (41,42). It is not yet known how TRIM5a mediates viral restriction, although a shorter, alternate transcript of the TRIM5 gene has been shown to be a ubiquitin ligase (43). TRIM5a restriction probably acts on the viral capsid (44), although direct physical interaction between TRIM5a and the capsid proteins has not yet been demonstrated.
TRIM5a variants from humans, rhesus monkeys and African green monkeys (AGM) display different but overlapping restriction specificities, which all have the following common property: each TRIM5a is unable to restrict retroviruses isolated from the same species, yet is able to restrict most retroviruses from other species (41). This indicates that TRIM5a is an important natural barrier to cross-species retrovirus transmission. This type of interaction between a host protein and a parasite protein leads to genetic conflict between the two proteins. Such a conflict may lead to rapid fixation of mutations that alter amino acids at the protein-protein interface, which is the hallmark of positive selection (6). Thus, it has been hypothesized that TRIM5a is in an antagonistic conflict with the retroviral capsid proteins. Sawyer et al. (21) analyzed the selection forces acting on TRIM5a and identified a patch of positively selected residues in the SPRY domain. This patch was identified as the species-specific determinant, which is sufficient and necessary for HIV restriction in rhesus monkey cells. Substitution of this patch from the human TRIM5a with the rhesus patch, and vice versa, conferred or abolished HIV-1 restriction, respectively (21). In fact, the region determining the speciesspecificity of the HIV-1 restriction was eventually mapped to two alternative positions in the rhesus SPRY domains (21). A single arginine to proline replacement at residue 332 of the human TRIM5a, or conversely the exchange of the six residues at positions 335-340 for the eight residues of the rhesus sequence, conferred the human TRIM5a an enhanced ability to restrict HIV-1 (22).
To test the use of Selecton, 20 primate TRIM5a sequences (21) were used as input for the Selecton server. The server was run with the MEC model (loglikelihood = -6716; AIC C score = 13 442) and compared with the M8a null model (log-likelihood = -6779; AIC C score = 13 564). Since the AIC C score of the MEC model is lower, we assume that the MEC model which allows for positives selection indeed fits the TRIM5a data better than a model which does not. The results of the MEC analysis were projected by the server onto the primary sequence of the human TRIM5a ( Figure 1). The full results of the run are available in the Gallery section of Selecton (http:// selecton.bioinfo.tau.ac.il/gallery.html). The results show an abundance of yellow-colored sites, indicating that TRIM5a has undergone extensive positive selection. Specifically, the two specific determinants conferring HIV-1 species-specific restriction showed exceptionally high levels of positive selection (Figure 1; positions boxed in black), indicating that these sites have undergone excess amino-acid fixations during the course of primate evolution. In fact, the entire SPRY domain (sites 281-493) displays extensive positive selection, as opposed to the RING finger domain (sites 15-59), the B-box domain (sites 90-132) and the coiled-coil domains (sites 130-241), which display mostly purifying selection with some dispersed positively selected sites.

CONCLUSIONS
We describe here Selecton version 2.2, a web-based bioinformatics tool for the identification of site-specific positive selection and purifying selection in a protein.
The minimal input for the server consists of a file of homologous coding sequences. The server performs a codon-based alignment of the sequences, calculates o values at each site, translates these ratios into selection scores and projects them onto the primary or tertiary sequence of the protein, allowing visual identification of blocks or patches of sites with similar o values. Advanced options of the server include choosing the method of calculation, inputting a phylogenetic tree of the homologous sequences and choosing from amongst a number of evolutionary models implemented in the server. To demonstrate the effectiveness of Selecton, the server was run on a dataset of homologous TRIM5a primate sequences. Selecton correctly identified the species-specific restriction determinants of the protein.
Thus, this analysis emphasizes the power of Selecton to accurately identify sites undergoing positive selection, and to present these results in a clear and user-friendly way.  (21) with the MEC model (8). Positive selection is colored in shades of yellow, and purifying selection is colored in shades of magenta. The two species-specific restriction determinants are indicated in boxes. Replacement of these positions with their rhesus equivalent positions leads to a reversal of restriction characteristics. Both determinants show a significantly high level of positive selection.