Sequence Harmony: Detecting Functional Specificity from Alignments

Multiple sequence alignments are often used for the identification of key specificity-determining residues within protein families. We present a web server implementation of the Sequence Harmony (SH) method previously introduced. SH accurately detects subfamily specific positions from a multiple alignment by scoring compositional differences between subfamilies, without imposing conservation. The SH web server allows a quick selection of subtype specific sites from a multiple alignment given a subfamily grouping. In addition, it allows the predicted sites to be directly mapped onto a protein structure and displayed. We demonstrate the use of the SH server using the family of plant mitochondrial alternative oxidases (AOX). In addition, we illustrate the usefulness of combining sequence and structural information by showing that the predicted sites are clustered into a few distinct regions in an AOX homology model. The SH web server can be accessed at www.ibi.vu.nl/programs/ seqharmwww.


INTRODUCTION
During the past years there has been a wide interest in studies of specificity-determining residues within protein subfamilies (1). Consequently, an increasing number of methods and web applications has become available that offer functional analyses of subtype specificity from multiple alignments (2)(3)(4)(5).
Previously we evaluated advantages and limitations of several of the state-of-the-art methods and introduced a new method called Sequence Harmony (SH) (6). SH accurately detects positions within an alignment that are responsible for functional differences between two protein subfamilies. To further facilitate the use of SH for a broad audience, we have implemented a comprehensive web server. The SH web server offers a fast, one-step analysis with a number of options, yielding results that can be interpreted easily.
In this article we will guide the user through all the steps of the SH web-application by means of a biologically relevant example. We will look for subtype specific sites for the two subfamilies of the alternative oxidases (AOX) protein family of plant alternative oxidases. The subtype specific sites found are the best candidates to explain the functional differences. Other relevant applications of this method include pathway specificity, ligand specificity, host-specific viral infection, viral disease progression differences and viral drug resistance.

Implementation
The SH method, as described previously (6), is currently implemented as an awk program. The main steps are as follows. Sequences are read from the alignment and separated into two user-specified groups. For each group separately and combined entropies (E) are calculated. SH values are calculated as SH ¼ 1 i.e. using the sum of the normalized frequencies of groups A and B. SH values range from zero for completely nonoverlapping residue compositions, to one for identical compositions. Next, sites are selected that have a SH score below a cutoff. Stretches of neighbouring selected sites are identified and the size of each of these stretches is assigned to the sites as the 'Rank'. Finally, selected sites are sorted on i) increasing SH, ii) decreasing Rank and iii) increasing entropy. This sorted list of selected sites is the primary result of the SH algorithm. Currently alignments of up to about 5 million residues (including gaps) can be processed with runtimes on the order of tens of seconds, excluding the generation of the high-quality output PyMol image.
The separate chains of the optional protein structure (PDB file) are aligned with the input alignment using the 'profile' option of Muscle (7). Subsequently, the chains with the highest average SH score with respect to the input *To whom correspondence should be addressed. Tel: þ31 20 598 7649; Fax: þ31 20 598 7653; Email: heringa@few.vu.nl ß 2007 The Author(s) This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/ by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited. alignment are selected for graphical display as an image in an interactive Jmol applet.

Use of the Server
The SH web server contains basic and advanced options. Sanity of the input options is checked, e.g. whether an input alignment is uploaded or pasted, and not both, whether it is proper FASTA format. Problems are reported with an error message and offending input fields are highlighted. An example screenshot of the main page and input form is shown in Figure 1. The main input is a multiple sequence alignment of the protein family and a subdivision into two groups. Sequence data in FASTA format is case-insensitive and may be split over multiple lines; gaps should be represented with a dash ('-'); and lines beginning with a semi-colon (';') are ignored as comment. From the alignment the program automatically generates a list of sequence IDs. The user defines the two input groups by selecting the identifier of the first sequence of the second group. By default, a cutoff of 0.2 for the SH score is used.
Advanced features allow more control over the analysis and output, but the default settings usually suffice for a basic analysis. The SH cutoff can be adjusted between zero (allowing no compositional overlap) and one (allowing full overlap). A higher cutoff will lead to the selection of a larger number of sites. A reference sequence can be selected from the alignment and a starting position can be set to provide a reference numbering in the output tables. Additionally, a PDB identifier can be specified by its four-letter code, to visualize the selected sites in the protein structure. Alternatively, a PDB file can be uploaded. The PDB file is automatically aligned with the multiple alignment, or manually when the reference position is set.
The output is a sorted table and optional graphical representations of the selected sites as an image (generated by PyMol, www.pymol.org) and as an interactive Jmol applet (8). The Jmol applet includes buttons to set the SH cutoff for displaying selected sites, and to show or hide nonaligned chains in the PDB file. An example screenshot of the main output page is shown in Figure 2. An additional, non-sorted, table is available that lists all sites in the alignment. The tables are coloured from red (predicted subtype specific sites) to blue (sites not predicted).

The AOX family of alternative plant oxidases
The cyanide-insensitive plant alternative oxidase (AOX) is a mitochondrial non-haeme quinol oxidase (9). An established function of AOX is thermogenesis in the spadices of Aroid lilies, but little is known about AOX function in other tissues (10). AOX is encoded in two discrete gene subfamilies (11). AOX1 is found in monocot and dicot plants and is induced by stress stimuli (11,12). AOX2 is usually constitutive and has at present not been found in monocot plants (11). We retrieved 31 AOX sequences from the NCBI nonredundant database (www.ncbi.nih.gov); 24 of the AOX1 and 7 of the AOX2 subfamily. Sequences were aligned using PSI-Praline (13,14) using secondary structure prediction information from PSIPRED (15). The alignment is available as Supplementary Data. Lacking an experimental AOX structure, we constructed a homology model using of the catalytic iron-binding N-terminal domain. Swiss-Model (16) was used to align the Sauromatum guttatum AOX1 sequence with the template structure 1AFR (17), following the description of Andersson and Nordlund (18). The PDB file is provided as Supplementary Data.

Sample input
The multiple alignment obtained was uploaded in the appropriate box of the SH web server and the first AOX2 sequence (AOX2_Arab_th) was selected as the first sequence of the second group ( Figure 1). The default SH cutoff of 0.2 was used, meaning that partial overlap in residue composition is allowed. We chose AOX1 from Sauromatum guttatum as reference sequence so that the subtype specific sites obtained will be numbered accordingly (18). The PDB file of the model structure was uploaded to the SH web server.

Sample output
The results page of the SH server ( Figure 2) shows at the top a summary of the input parameters, and optionally the graphical representations of the protein structure. Below that is the table containing all sites with a SH score (SH) below the cutoff value (default 0.2). These are sorted first on increasing SH score, then on decreasing rank (Rnk, number of neighbouring sites below the cutoff) and finally on increasing total Entropy (AB) of the alignment position, as described previously (6). Position Ali and Ref show the sequence position in the alignment and reference sequence, respectively. The 'Consensus' columns give all residue types present in groups A and B, respectively, in order of decreasing frequency and in lowercase when the frequency is less than half of the highest. In addition, a link is provided to the raw table, that holds all information of the complete alignment without selection or ranking. The raw table is available as Supplementary Data.
For the AOX family we found nine sites with a SH value of zero (bright red in Figure 2). An additional eight sites have SH values between zero and 0.2 (from light red to white). In the 'Rnk' column, a value of 2 signifies a stretch of two consecutive sites with SH values below the cutoff (alignment positions 244-245 and 121-122). Interestingly, out of nine sites with zero harmony, four have non-zero entropy in at least one group, i.e. they are not totally conserved within both groups (alignment positions 136,245,269,354). This is also reflected in the 'Consensus' columns, which show multiple residue types occurring.
The  harmony and six of low harmony. Four low-harmony sites on the upper-left (Phe224, Val227, Ala228 and Asn284) are close to the Glu and His iron binding residues (18) (Figure 3), and could conceivably play a role in ligand binding or activation mechanisms. Three other lowharmony sites, Arg199, Tyr249 and Gly252 are located around a putative membrane-binding region (18), and could conceivably play a role in modulating the proteinmembrane interactions or substrate access. The functional prediction of SH from the multiple alignment seems to be consistent with the structural prediction using homology modelling, although a different structural model could lead to different conclusions.

CONCLUSION
Our SH web server identifies putative subtype specific sites based on an alignment and selection of two groups as input only. In addition, sites can be easily linked to structural information using a structure from the PDB directly, or, as shown in our example, using a homology model of the protein. Using this structural information, selected subtype specific sites can be grouped into spatial clusters of sites that are likely to share functional relationships.
The combination of the SH algorithm and protein structural information has yielded a useful tool to interpret multiple sequence alignments, and to guide subsequent selection of interesting sites for experimental investigation.

SUPPLEMENTARY DATA
The AOX input alignment, raw output table, 'coloured' PDB and pymol script are available as Supplementary Data. In addition, the output is available at www.ibi. vu.nl/programs/seqharmwww/showcase_aox.