Annotation and cross-indexing of array elements on multiple platforms.

On the surface, transcript profiling using microarrays seems to offer a way of looking at the global response of the cell to perturbation, with a focus on changes in gene expression. The difficulty, however, is that the response of a particular gene is actually measured on the array by an element that is a short, defined nucleic acid sequence. Sequences that map back to the same genetic locus may actually be given different names and descriptions when they are deposited in public sequence databases; when such sequences are used in microarray construction, elements that monitor the same genetic locus may have different names and descriptions. The algorithm described here uses a hierarchical approach to assign a single best annotation to the elements in a given microarray in such a fashion that elements from one microarray platform may be cross-indexed with those of another. The algorithm relies on the nucleic acid accession number for a given array element, and uses that to retrieve annotation from the most recent versions of LocusLink and UniGene. Both database resources are searched, with a priority being given to annotation derived from the curated LocusLink database. In lieu of annotation found in these databases, the default GenBank annotation is used. As a final outcome, a cross-chip identifier is generated that may be used to cross-index array elements. The program is available as a practical extraction and report language (Perl) script that can run under any Perl interpreter.

On the surface, microarrays and other genomic technologies offer the toxicologist a look at the transcript levels for hundreds to thousands of genes. However, although toxicologists and cell biologists think in terms of genes and pathways, these technologies actually measure nucleic acid sequences. Thus, the challenge is to clearly associate a given nucleic acid sequence with the most current and consistent information on the gene of which it is part. This association is complicated by the fact that the same sequence can be submitted to public databases from several sources that may assign it different names and descriptions. For example, the gene, N-myc downstream regulated (Ndrg1) (LocusID 10397; http://www.ncbi.nih.gov/ LocusLink/) was originally cloned and submitted by three laboratories as different sequences with different names: RTP (accession no. D87953; http://www.ncbi.nih.gov/ GenBank), a homocysteine-respondent gene in vascular endothelial cells (Kokame et al. 1996); DRG1 (GenBank accession no. X92845), a gene upregulated during colon epithelial cell differentiation (Van et al. 1997); and CAP43 (GenBank accession no. AF004162), a gene specifically induced by Ni 2+ compounds (Zhou et al. 1998). All three sequences are identical and represent the same gene. Microarrays are built using individual sequences or clones that are annotated in this fashion, and thus identifying microarray elements (i.e., spots) on a single array or on different arrays that represent a certain gene can be a frustrating exercise.
Our approach to annotate microarray elements makes use of two public databases: UniGene (http://www.ncbi.nih.gov/ UniGene/; Wheeler et al. 2000) and LocusLink (http://www.ncbi.nih.gov/ LocusLink/; Pruitt and Maglott 2001). Whereas UniGene is an experimental system for grouping GenBank sequences (http://www.ncbi.nih.gov/GenBank/) into gene-oriented clusters, LocusLink is a database of curated sequence and descriptive information about genetic loci. Together these resources allow us to map a given microarray element to a certain gene, using UniGene and the GenBank accession number of the element, and to annotate that gene using LocusLink information. Furthermore, the process for doing so is automated with a computer script that can be run on a regular basis to make use of current database information. Although our approach appears to be similar to that taken by the DRAGON database (http://pevsnerlab.kennedykrieger.org/dragon.htm; Bouton and Pevsner 2000) and the DAVID software (http://apps1.niaid.nih.gov/david/ upload.asp) (Dennis et al. 2003), ours seeks to create a single best annotation for a sequence and, based upon this hierarchical process, to generate a cross-chip ID. Although there are caveats to this approach, the results show that it generally allows for intra-and interplatform identification of microarray elements representing a single gene. This approach has been applied to comparing results generated in the multilaboratory genomics research program coordinated by the International Life Sciences Institute (ILSI) Health and Environmental Sciences Institute (HESI) Committee on the Application of Genomics to Mechanism-Based Risk Assessment.

Materials and Methods
Algorithm rationale. Most developers of microarrays, either private or commercial (e.g., Affymetrix, Inc., Santa Clara, CA) will provide for each array element (i.e., probe) a GenBank accession number indicating the sequence or clone that the element represents or is derived from. On the other hand, the descriptive information for such GenBank entries or the locus that they are associated with may change as new information is deposited in the public databases, especially UniGene and LocusLink. Furthermore, UniGene and LocusLink can serve as sequence "Rosetta stones" where a) UniGene serves to collate accession numbers, b) UniGene integrates with LocusLink, c) LocusLink serves as a curated annotation database with canonical gene names and curated gene information, and d) LocusLink integrates with other information such as OMIM (Online Mendelian Inheritance in Man). To represent the best information for a particular microarray element, a cross-chip ID (XChipID) can be created based upon UniGene and LocusLink information, as described below. On the surface, transcript profiling using microarrays seems to offer a way of looking at the global response of the cell to perturbation, with a focus on changes in gene expression. The difficulty, however, is that the response of a particular gene is actually measured on the array by an element that is a short, defined nucleic acid sequence. Sequences that map back to the same genetic locus may actually be given different names and descriptions when they are deposited in public sequence databases; when such sequences are used in microarray construction, elements that monitor the same genetic locus may have different names and descriptions. The algorithm described here uses a hierarchical approach to assign a single best annotation to the elements in a given microarray in such a fashion that elements from one microarray platform may be cross-indexed with those of another. The algorithm relies on the nucleic acid accession number for a given array element, and uses that to retrieve annotation from the most recent versions of LocusLink and UniGene. Both database resources are searched, with a priority being given to annotation derived from the curated LocusLink database. In lieu of annotation found in these databases, the default GenBank annotation is used. As a final outcome, a cross-chip identifier is generated that may be used to cross-index array elements. The program is available as a practical extraction and report language (Perl) script that can run under any Perl interpreter. Algorithm and logic flow. The logic flow of annotation is illustrated in Figure 1. Essentially, the program searches the UniGene database for the accession number in question. If the accession number is referenced in UniGene, the next step is to seek information in LocusLink, using the UniGene Cluster ID. If the accession number is not referenced in UniGene, then the LocusLink database is checked for the accession number. (Some accession numbers are referenced in LocusLink but not in UniGene.) If the accession number is not referenced in either the UniGene or LocusLink databases, then the annotation in GenBank associated with that accession number is used. As noted, a XChipID is constructed on the basis of the best ID available, a LocusID being preferred to a UniGene ID, and if neither is found, a GenBank accession number. The prefix to the XChipID indicates the origin of the identifier (LL., LocusLink; Rn., rat; UniGene; Ac., GenBank).

Annotation and Cross-Indexing of Array Elements on Multiple Platforms
Input files and software programs. The files obtained from the National Center for Biotechnology Information (NCBI) are listed in Table 1, along with the key value and cross-indexed values obtained from each file. Data reported here made use of Rattus norvegicus UniGene Build no. 117 and LocusLink data current to 27 May 2003. Scripts (i.e., program code) were written in Perl, version 5.6.1, a programming language developed in 1988 by Larry Wall as Open Source software (http:// www.perl.org). Perl scripts are text-based programs run by an interpreter program, which has been developed for almost every operating system (e.g., Mac, PC, UNIX). Five Perl scripts were developed: UgXRef.pl to extract data from the Rn.data UniGene file; UgDupe.pl to examine UniGene data for duplicate entries; ChipXAnno.pl to collate data from the LocusLink, UniGene, and microarray definition files and carry out the annotation; ChipDupe2.pl to examine the microarray annotation file for multiple entries based on the XChipID; ChipCompare8.pl to compare two different microarray annotation files for overlapping entries; and XChipData.pl to merge data sets from two different microarray platforms. The outputs of all programs are simple text files, most of which are tab delimited, that can be imported into analysis programs such as Microsoft's Excel and Access (Microsoft Corp., Bellevue, WA) and Spotfire DecisionSite. All these programs have been run in a disk operating system (DOS) command line window using ActivePerl (binary build 629; http:// www.activestate.com), although after conversion of the end-of-line sequence they run under UNIX. UgXRef.pl processes UniGene files and as such is memory intensive: for large UniGene files (e.g., for mouse and human), these scripts must be run on either DOS or UNIX systems with > 1 GB RAM. The scripts are small and are available from the web site for the HESI Committee on the Application of Genomics to Mechanism-Based Risk Assessment (http://hesi.ilsi.org/publications/index.cfm? pubentityid=120).
Microarray definition files listing each microarray element and its associated accession number and description were obtained from individual vendors through the ILSI consortium.
The Blast2 program (http://www. ncbi.nlm.nih.gov/blast/bl2seq/bl2.html) was used to investigate the similarity and identity of various sequences at the protein level.

Results
Array annotation. The algorithm replaces frequently minimal sequence descriptions with biologically meaningful annotation. Thus, elements originally annotated as ESTs (Expressed Sequence Tags) are identified as corresponding to Gstm2 and Lgals1 (Table 2). It is important that in doing so the algorithm identifies multiple elements, including ESTs, that query the same locus. Examples given in Table 2 include cytochrome P450 1b1 (Cyp1b1), phosphodiesterase 4B (Pde4b), Cyp4a10, and endothelin receptor (Ednrb). Conversely, the algorithm can highlight elements incorrectly annotated. Thus, U39571, an element described as phosphatidylinositol  Occasionally microarray elements are incorrectly annotated and grouped. Although both accession number X81395 and accession number U10697 were annotated as carboxylesterase 1 (Ces1) (presumably because of DNA sequence homology), the amino acid sequences are divergent enough to suggest that these are indeed two different proteins (data not shown). However, as UniGene clusters and LocusLink information are updated, incorrect groupings can be resolved. Thus, when UniGene and LocusLink information from September 2001 was used, X14552 (alpha-2µ globulin, type 1) and M83298 (phosphatase 2A 55-kD regulatory subunit alpha) were annotated as caldesmon (LocusID 25687), based on short sequence overlaps. Using February 2002 UniGene and LocusLink data the sequences identified by these GenBank accession numbers were distinguished from caldesmon (data not shown). As with any system using these resources, the annotation is only as current as the UniGene and LocusLink files used for input.
The XChipID represents the best information available identifier for a given sequence element and as such offers a means to a) group elements that actually represent the same gene and b) estimate the number of unique genes queried by the microarray. Thus, the 8,740 elements on the Affymetrix RGd_U34a array are estimated to query a total of 6,385 unique genes (Table 3). Of course, the actual sequence queried by each element is different, and as such, these sequences may have different hybridization characteristics and give rise to quantitatively different signals.

Identification of homologous targets across array platforms.
Using the XchipID, one can determine genes queried in common by two different microarray platforms and compare results at a relatively simplistic level.
Cross-array comparisons of the Affymetrix RG_U34a, the NIEHS 7K array (National Institute of Environmental Health Sciences, Research Triangle Park, NC), and the Clontech Atlas Tox2 arrays (Clontech, Palo Alto, CA, USA) indicate overlaps, as well as a substantial number of genes uniquely queried by each array (Figure 2).  In fact, the three arrays query only 209 genes in common, and even the Clontech array queries a significant number of genes not queried by the other two arrays. On a case-by-case basis, the results for a given gene on one platform can be compared with those for the same gene on a different platform, using the XChipID ( Thompson et al. 2004), taking into account that each platform may query the same gene more than once. It is critical to reiterate, however, that the quality and intensity of the signal from any given microarray element, querying a given gene, will depend on the sequence of that element, preparation of the target hybridization material, and technical aspects of the hybridization and signal processing. Furthermore, comparing platforms based on the XChipIDs depends on these platforms being annotated from the same input UniGene and LocusLink files. When these files are updated, the annotation process must be repeated for all platforms to be compared. Finally, comparing data from one array platform to another on a wholearray level is not a trivial effort, as the redundancy of genes queried on each platform creates what is called in database terminology a "many-to-many" relationship. XChipData.pl was designed to merge such data, and an example of the output from this program is given in Table 4.

Discussion
As microarrays are used more and more to investigate questions of biology and toxicology, a key technical issue becomes more and more problematic: that of associating the signals from each microarray sequence element with the known literature and biological context associated with that sequence. This issue is complicated because element descriptions are current only at the time of array construction and must be updated to reflect evolving information on the gene associated with the element. Such information can include an updated description, a standard gene/locus name White et al. 1999), and gene ontology information (Ashburner et al. 2000). Several automated annotation systems have been described, including the DRAGON system (Bouton and Pevsner 2000), the DAVID software (Dennis et al. 2003), and the NetAffx resource specifically for Affymetrix arrays (http:// www.affymetrix.com; Liu et al. 2003). Information from this latter resource can be automatically retrieved using the ChipInfo software (http://biosun1.harvard.edu/complab/chipinfo/; Zhong et al. 2003). The XChipAnno script described here differs in that it is designed to create a single best annotation and a XChipID. Although conceptually simple, the XChipID does group elements that, by annotation, should be querying the same gene, and in doing so allows for comparison of data across a microarray, between different versions of a microarray, and between different microarray platforms. This annotation can be carried out on a regular basis as public database information is updated. In addition, this annotation procedure requires only the GenBank accession number for a microarray element, not the actual sequence, and does not require extensive computer resources. The RESOURCERER database (http://pga.tigr.org/tigr-scripts/ nhgi_scripts/resourcerer.pl; Tsai et al. 2001) carries out a similar annotation approach using the TIGR Gene Indices and extending this cross-indexing to across species. In contrast to XChipAnno, RESOURCERER focuses on a number of selected common microarray platforms and is accessible by a web interface.
A limitation of this approach, and any approach that groups accession numbers on the basis of UniGene clusters, is that any given build of UniGene may incorrectly cluster certain sequences. Sequence homology can cause closely related but nonidentical genes to cluster together and hence be given the same annotation by this approach. Thus, discordant results for microarray elements having the same annotation (i.e., XChipID) are best resolved by a rigorous BLAST comparison of element sequences with each other and with the target gene sequence. Although a BLAST comparison of each microarray element sequence with the entire sequence database is technically daunting, a simple comparison of such a sequence with a target sequence is quite simple using the LALIGN program (part of the FASTA package; ftp://ftp.virginia.edu/ pub/fasta/) (Chao et al. 1992) and could be automated as a quality control check for the annotation of the entire microarray.
Another serious limitation in comparing different microarray platforms is encountered if one array uses sequences from several species, for example, a rat cDNA-based microarray that includes mouse sequences. Although these sequences may hybridize with a rat transcript, annotation by this method is not feasible, as individual species are clustered in UniGene separately. Such cross-species comparisons are desirable but may be best handled by large public database resources that link individual sequences with genomic information (Mattes et al. 2004).
Although any automated procedure to group and annotate DNA sequences is inherently flawed by the absence of human  Figure 2. Overlaps of genes queried by different platforms, determined by ChipCompare.pl using the XChipID. wisdom, such an automated approach is simply required to handle the vast amount of information contained within and generated by microarray technology. The approaches described in this article do help reduce the complexity and redundancy of microarray annotation in a straightforward fashion. The files required by this approach are readily available, and the output files generated may be directly used and manipulated with a variety of software packages such as Excel, Access, or Spotfire. Although microarray results are always best considered on a sequence-by-sequence basis, global annotation procedures can offer a way to provide an initial sift and analysis of the data with biological context.