A quicker way to clone.

We present BLAST on Orthologous groups (BLASTO), a modified BLAST tool for searching orthologous group data. It treats each orthologous group as a unit and outputs a ranked list of orthologous groups instead of single sequences. By filtering out redundancy and putative paralogs, sequence comparisons to orthologous groups, instead of to single sequences in the database, can improve both functional prediction and phylogenetic inference. BLASTO computes the significance score of each orthologous group based on the individual BLAST hits in the orthologous group, using the number of taxa in the group as an optional weight. This allows users to control the species diversity of the orthologous groups. BLASTO incorporates the best-known multispecies ortholog databases, including NCBI Clusters of Orthologous Group, NCBI euKaryotic Orthologous Group database, OrthoMCL, MultiParanoid and TIGR Eukaryotic Gene Orthologues database, and offers a useful platform to integrate orthology information into functional inference and evolutionary studies of individual sequences. BLASTO is accessible online at http://oxytricha.princeton.edu/ BlastO.


INTRODUCTION
With the accelerating discovery rate of new sequences from increasingly more diverse species and environmental sources, sequence searches and comparisons have become common practice to yield clues about a sequence's function and evolutionary relationships to others in the databases. Consistent information about orthology provides the basis for inferring phylogenetic relationships (1) and is more and more critical to sequence annotation in the absence of direct functional studies for every gene (2)(3)(4)(5)(6).
In the post-genomic era, many studies have classified sets of orthologous sequences among different species, and therefore many multispecies databases of orthologous groups are available: NCBI Clusters of Orthologous Groups (COG) database for unicellular organisms (2) contains putative orthologous groups among 66 genomes that include mostly prokaryotes and three eukaryotes. NCBI euKaryotic Orthologous Group database (KOG) (7) contains seven fully sequenced eukaryotic genomes. OrthoMCL DB (8) includes 55 genomes with more evenly distributed prokaryotic and eukaryotic species. MultiParanoid (9) contains four animals: human, fly, worm and sea squirt. TIGR Eukaryotic Gene Orthologues database (EGO) (10) is built on the consensus EST cluster sequences from 88 eukaryotic species.
However, most of the currently available tools for searching these ortholog databases either only permit searches by gene/group names, which limits the search to the sequences and species that are in the ortholog databases, or only allow conventional BLAST-based sequence comparison, which outputs results as individual sequences without taking advantage of the orthologous group information contained in the databases. Comparisons at the level of orthologous groups, on the other hand, can provide a list of significant similarities to orthologs, by filtering out paralogs and xenologs as well as other redundancy and random noise in the individual sequence hits. Such comparisons would yield more consistent conjectures regarding the function of the query sequence, since orthologous sequences are more likely to conserve protein function than paralogous and xenologous sequences (2)(3)(4)(5)(6)11). Although tools exist for protein molecular function prediction based on phylogenomic methods, such tools either require the whole proteome data, such as COG (2), or require knowledge of the phylogenetic position of the queried species on a reference tree to infer ortholog information independently from the available ortholog databases (5,6,11). These methods are not applicable to individual sequences recovered from unknown species, such as those from environmental samples, or from species whose phylogenetic positions are still ambiguous. Currently, only KOGnitor (http://www.ncbi.nlm.nih.gov/COG/grace/ kognitor.html) incorporates orthologous group information into its search results and can be used on individual sequences without knowledge of the phylogenetic placement of the query species; however, KOGnitor only seaches NCBI KOG. As it is often a challenge in phylogenetic reconstruction to identify the orthologous sequences and to avoid other types of homologs, sequence comparison at the level of orthologous groups can also facilitate accurate phylogenetic reconstruction by using the well-established orthology relations in the ortholog databases.
We created a BLAST on Orthologous groups (BLASTO) web server, a modified BLAST tool designed to search orthologous group data. It compares individual sequences to the sequences in various ortholog databases and, without knowledge of the query species or sequence, allows the user to identify a group of sequences that are putatively orthologous to the query sequence.
A regular BLAST search produces a list of single sequences ranked by the statistical significance of their similarity to the query sequence. BLASTO, however, uses the information in the corresponding ortholog databases to group orthologous sequences together as a unit, computes a single significance score for each orthologous group based on the individual BLAST hits within the group, and outputs a ranked list of orthologous groups, as well as single sequences from the database that are not part of any orthologous group. The significance scores for each orthologous group can be further weighted in a userdefined fashion by the number of taxa in each orthologous group. To facilitate functional inference of the query sequence, BLASTO lists the annotation of orthologous groups and individual genes, when such information is available. The server also incorporates ClustalW (12) to allow one to infer the putative evolutionary relationships among the orthologous group sequences and the query sequence via multiple sequence alignment and distance tree construction.
By incorporating the best-known multispecies ortholog databases, including NCBI COG (2), NCBI KOG (7), TIGR EGO (10), OrthoMCL DB (8) and MultiParanoid (9), BLASTO also serves as a hub for all these ortholog databases. Users can select any of these databases for their query.

RATIONALE AND METHODS
Currently, for sequences from unknown species or those that lack a completely sequenced genome, orthology prediction depends mainly on one-way best hits to completed genomes with precomputed information about orthology. However, best hits to different completed genomes can be inconsistent with each other, especially when the evolutionary distance is large. This could be due to the increase in variance in measurable molecular evolutionary distance over time, gene fusion events in the query species, or ancient duplications in the genomes used for comparison (which make the query sequence orthologous to more than one orthologous group in the database, as a result of duplications that preceded speciation events). Indeed, when we compare 9934 randomly selected sequences from Tetrahymena thermophilia to the seven completed genomes in NCBI KOG database, more than half (5261) of the Tetrahymena sequences have single best hits in different species that yield inconsistent orthologous group assignments. BLASTO reports the best-scored orthologous group, instead of individual sequences. By averaging the significant scores for single sequences in the orthologous groups (see below for details), BLASTO reduces the noise in the inferred genetic distances (assuming independent evolution after species' divergence). Among the 9934 randomly selected Tetrahymena sequences, 1232 have as their single best hit an individual sequence in a different orthologous group than the BLASTO top-ranked orthologous group. By reporting all orthologous groups with comparable scores, BLASTO can also help reveal ancestral duplications or gene fusion events. Representing a highly divergent lineage, about 30% of the tested Tetrahymena sequences have scores for their two top-ranked NCBI KOG orthologous groups that differ by less than 10%, reflecting either gene fusion events in Tetrahymena (whether authentic evolutionary events or artifacts of annotation) or ancient duplication events in the genomes in the NCBI KOG database. Like any computational algorithm for orthology prediction, the results from BLASTO need experimental confirmation, but we expect BLASTO to become a general tool for inferring information about orthology, particularly for sequences of unknown species origin or from partially sequenced genomes.
Similar to regular BLAST, for each query sequence, BLASTO begins by performing a BLAST search against all the sequences in the selected ortholog database ( Figure 1A). At this point, a regular BLAST search would output a list of single sequences ranked by their statistical significance. BLASTO, however, uses the information in the corresponding ortholog databases to group orthologous sequences together as a unit ( Figure 1B), and then computes a single significance score for each orthologous group, based on the individual BLAST hits within the group ( Figure 1C) as follows: The score of each orthologous group (S g ) is computed as the average negative log-likelihood of the BLAST hits among all the sequences in the orthologous group.
where N is the number of sequences in the orthologous group.
where E i is the E-value of the BLAST hit (13). P i ¼ 1 for sequences without a significant BLAST hit. The orthologous group scores computed via the description above tend to bias against orthologous groups with wider taxonomic span (therefore more diverged orthologous sequences), but the user may prefer orthologous groups with different levels of species diversity. Orthologous groups with wider taxonomic span are likely to be more ancient and universal, while orthologous groups with more narrow taxonomic span are likely to be more recent or confined to a subgroup of a species or a lineage. To provide users the ability to control for their preference for orthologous groups with different taxonomic spans, the orthologous group scores should ideally be further weighted by the degree of species diversity in the group measured on a consensus species tree. Unfortunately, there is currently no such stable consensus tree available (14)(15)(16)(17)(18) nor do trees based on ribosomal RNA sequences necessarily correlate well with the species tree (19). Alternatively, we have considered measuring the species diversity represented in each orthologous group by the number of steps to reach a last common ancestor in the NCBI taxonomy database, but this method will bias in favor of clades that contain many closely related species, subspecies or strains (20). Therefore, as an approximation, we currently weight the orthologous group scores by the number of taxa in the group (T g ) divided by the total number of species included in the ortholog database (T). The weighted score is: When W is þ1, the orthologous groups are ranked by their number of taxa in a descending manner; when the W is À1, the orthologous groups are ranked by their number of taxa in an ascending manner.
The orthologous groups are then ranked by their weighted group score and reported to the user as a list, together with the significant individual hits within each group (a query sequence need not have a significant hit to every sequence in a group). The significant BLAST hits among individual sequences that are not part of any orthologous groups are also ranked by their score and displayed as a separate list. In the future, we plan to offer users different optional weights, including taxa number, the spanning distance on a tree constructed with ribosomal RNA sequences, or a user-defined species tree.

BLASTO SERVER
BLASTO is built upon a MySQL relational database and a BLAST search engine on a Unix server. The orthologous group data and annotations were downloaded from the corresponding websites, imported into the MySQL database and will be updated every 6 months. The current server includes data from NCBI COG (March, 2003  The BLASTO interface is very similar to the standard BLAST input interface that users are familiar with. It allows the user to submit either a DNA or a protein sequence with many of the common BLAST parameter choices, such as filtering, genetic codes, substitution matrices and frame-shift penalties. The user can select any of the following orthologous group databases for their query: NCBI COG, NCBI KOG, OrthoMCL DB, MultiParanoid and TIGR EGO. The user can also select different values for the BLASTO-specific parameter, 'weight by group taxa number,' to control the preference for orthologous groups containing different levels of species diversity. In general, higher (positive) weight favors orthologous groups that contain more taxa; these groups tend to have greater levels of species diversity. Lower (negative) weight favors orthologous groups with fewer taxa; these groups tend to have less species diversity.
The BLASTO output is composed of three parts (Figure 2A). The first part is a list of orthologous groups containing sequences with significant BLAST hits to the query sequence. The groups are sorted by the weighted group score, and the individual sequences in each group are sorted by their BLAST score. The second part is a separate list of individual sequences that are not part of any orthologous group, sorted by their BLAST score. The third part, as many other BLAST family programs, displays the alignments of the query sequence with individual sequences. The description of the orthologous groups and individual sequences are listed when available. External links are provided for the orthologous groups to their original ortholog databases and for the individual sequence entries to their associated database resources.
Users can retrieve a dynamic summary of any of the listed orthologous groups by clicking on the orthologous group names ( Figure 2B). The orthologous group summary page lists all the information on the orthologous group and the included sequences, and indicates whether each specific sequence shares significant BLAST hits with the query sequence. Although individual sequences can be retrieved by clicking on their names at any time, one can choose to retrieve all sequences or only those with significant BLAST hits from the orthologous group in batches from the orthologous group summary page, making it convenient for the users to perform further local analyses among the sequences. The user can also choose to run ClustalW (12) on the query sequence and sequences (all or those with BLAST hits) from the orthologous group on our server when the total number of sequences is smaller than 50. The ClustalW result page ( Figure 2C) displays the multiple sequence alignment and an un-rooted Neighbor-joining distance tree from which putative phylogenetic relationships between the query and the orthologous group sequences can be inferred. When the total number of sequences is greater than 50, the users are advised to batch download the sequences and perform the analysis locally.
We evaluated the performance of the BLASTO server using a list of 2000 sequences randomly chosen from the orthologous group databases: We first masked all the sequences from the same species as the test sequence in the corresponding ortholog database, which effectively decreased the number of species in the ortholog database by one. We then ran BLASTO on the masked database and compared the highest-ranked orthologous group in the BLASTO result with the original orthologous group assignment of the test sequence. Assuming the original orthologous group assignments in the orthologous group databases reflect the correct orthology relations, the results show that BLASTO can infer the correct orthology information in 85% of the cases.

SUMMARY
The BLASTO integrates orthology information into BLAST searches, and reports orthologous groups that share significant similarity with the query sequence. Built upon existing and well-received orthologous group databases that use only completed genomes, this ensures maximum accuracy of orthology predictions. BLASTO is designed for individual sequence searches, requires no local computational effort from the user, and provides a simple platform to use orthologous group information to infer the possible function of a query sequence and its putative phylogenetic relationships to other sequences in the databases. With the discovery of more and more novel sequences, we expect this web server to become a valuable search engine and an integrated ortholog data hub for the molecular evolution and comparative genomics community.
testing. This work is supported by National Institute of General Medical Sciences Grant GM59708 and National Science Foundation Grant 0622112. Funding to pay the Open Access publication charges for this article was provided by National Science Foundation.