Wilcox Elected President of Society for Epidemiologic Research

GeneHub-GEPIS is a web application that performs digital expression analysis in human and mouse tissues based on an integrated gene database. Using aggregated expressed sequence tag (EST) library information and EST counts, the application calculates the normalized gene expression levels across a large panel of normal and tumor tissues, thus providing rapid expression profiling for a given gene. The backend GeneHub component of the application contains pre-defined gene structures derived from mRNA transcript sequences from major databases and includes extensive cross references for commonly used gene identifiers. ESTs are then linked to genes based on their precise genomic locations as determined by GMAP. This genome-based approach reduces incorrect matches between ESTs and genes, thus minimizing the noise seen with previous tools. In addition, the gene-centric design makes it possible to add several important features, including text searching capabilities, the ability to accept diverse input values, expression analysis for microRNAs, basic gene annotation, batch analysis and linking between mouse and human genes. GeneHubGEPIS is available at http://www.cgl.ucsf.edu/ Research/genentech/genehub-gepis/ or http://


INTRODUCTION
Analysis of gene expression profiles across various tissue types is essential for understanding gene functions. Among various available expression data sources, expressed sequence tags (ESTs) have been valuable for rapid expression profiling. Based on the premise that EST clone frequency is proportional to the corresponding gene's expression level (1), we and others have developed algorithms and tools to perform expression analysis based on EST data (2)(3)(4)(5)(6)(7). Meanwhile, EST data continue to accumulate at a rapid pace, and there are a growing number of databases that organize general or speciesspecific EST information, including EST data for sea bass, wheat, chicken, pig and tomato (8)(9)(10)(11)(12). Despite the surge of recent progress in other species, the number of public EST entries for human and mouse still far exceed those for any other species, based on the January 2007 summary from dbEST (http://www.ncbi.nlm.nih.gov/ dbEST/). Since the reliability of EST-based expression analysis is dependent on the size of EST libraries, the human and mouse data remain an attractive source for expression analysis, and the tools built for analyzing these data will likely benefit expression analysis for other species.
We previously developed the GEPIS server that utilizes EST abundance information to calculate gene expression levels in a panel of normal and cancerous human tissues for a given input DNA sequence (7). We showed that such EST-based (or 'digitally' derived) expression units exhibit a linear correlation with TaqMan-determined expression levels. Since its release, the GEPIS server has provided expression results for over 30 000 requests by researchers from 460 countries. Despite its usefulness, GEPIS suffers from several limitations. The method relied on the BLAST program to assign EST sequences to a given input mRNA sequence. However, BLAST often erroneously links ESTs to input sequences due to high-percentage regional matches. As a result, a given EST could be matched to multiple genes, thus leading to miscalculated expression data. In addition, there were insufficient data for performing reliable analysis for mouse genes. The design of the system also did not allow easy development of new functionality commonly requested by users, such as URL linking to the expression results, text searching and display of detailed results. *To whom correspondence should be addressed. Tel: þ1 650 225 4293; Fax: 650 225 5389; Email: zemin@gene.com ß 2007 The Author(s) This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/ by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
We have now developed a new web server, named GeneHub-GEPIS, which performs digital expression analysis based on an integrated database for human and mouse genes. One distinguishing characteristic of this tool is that ESTs are mapped to pre-defined gene structures along the genome. The GeneHub component of the application is designed to define gene boundaries based on mRNA transcript sequences from major databases and to establish extensive cross references for commonly used gene identifiers. Based on the precise genomic locations of ESTs, as determined by the GMAP algorithm (13), we link ESTs to genes for subsequent expression analysis. The new design offers several major advantages. First, this genome-based approach increases the accuracy of EST mapping to genes, thus enhancing the overall reliability of the EST-based expression values. Second, the new genecentric design makes the system more extensible, so that we could easily add a new collection of genes, such as microRNAs, to the system. Third, the integrated gene database accepts text-based searches and diverse input values. In addition to DNA sequences, the input values can be identifiers from common gene databases or commercial microarray platforms. Fourth, the orthologous relationships stored in the GeneHub database allow easy navigation between mouse and human genes. Finally, the program provides basic information about input genes and allows direct linking to expression results from any web site. Meanwhile, we retained the useful features from the previous GEPIS application, such as the capability to draw a regional expression atlas for a given genomic region. Here, we present GeneHub-GEPIS as a new and useful tool for performing gene expression analysis across many normal and cancer tissues for both mouse and human genes.

Genome-guided gene definition and cross references
The genomic structures of protein-coding genes were first defined using transcripts from several reliable sources. The collection of such high-quality transcripts, which we also call the core gene set, contains mRNA sequences from RefSeq, the Known gene set of Ensembl genes, Proteome and FANTOM (mouse only) (see Supplementary Table 1 for details). Each of the core gene set sequences was mapped to their respective genome (human NCBI Build 36 and mouse NCBI Build 35) using GMAP (13), and only the genomic match with the highest percent identity and percent coverage was chosen. GMAP has been shown to provide very accurate mapping and alignment results for both mRNA transcripts and ESTs (13), but occasionally matches with lower matching percentage can be found due to low-quality sequences. As a conservative precaution, we removed transcripts (and ESTs for a later step) with 590% coverage of the entire transcript or with 590% identity as measured by GMAP. For instance, about 0.4% RefSeq sequences were filtered out during this step. For any two transcripts to be clustered into one gene, we required that their exon sequences overlap, be in the same orientation and share at least one exact exon boundary or splice site. The requirement for shared exon boundary was used to limit the inclusion of antisense transcripts, since the orientation of these transcripts could be occasionally mis-annotated (14,15). Each group (or cluster) of transcripts was considered as a 'GeneHub gene'. Using this approach, we defined 31 999 non-redundant genes for the human and 34 794 for mouse. Overall, these GeneHub genes can be considered as a superset of known genes from the above data sources.
After the initial collection of human and mouse genes were obtained, additional sequences were mapped to the GeneHub genes. Transcripts from GenBank and the Ensembl Novel collection were mapped to the genomes with GMAP (13) and then compared with those GeneHub genes derived above. For a transcript to be linked to a GeneHub gene, at least one of the splicing junctions was required to match perfectly with those of the GeneHub gene. We next tried to assign microarray-related probe sequences (see Supplementary Table 1 for the full list) from commonly used commercial array platforms to GeneHub genes. For Affymetrix expression arrays, we used the target sequences obtained from Affymetrix to link to known genes. For Agilent oligo-based expression microarrays, we directly used the 60-mer oligo-nucleotide sequences for gene linking. Using GMAP (13), we determined whether the array probe sequences overlap with the exon sequences collected above. Next, for sequences that did not overlap with any exons, we examined whether they were located in the vicinity of any GeneHub genes. We assigned a probe sequence to the closest gene in the same orientation if the probe sequence was located within 5 kb to the 3 0 end or 2 kb to the 5 0 end of the gene.
Additional methods were used for both human and mouse data to link other protein or DNA identifiers to GeneHub genes. First, for the roughly 1% of all transcript sequences that failed to align to the genome, we compared their sequences to the core transcript set using BLASTn, and required a perfect match over at least a 60-bp region with a transcript member of the GeneHub gene. Second, protein sequences from UniProt and PDB were compared with the GeneHub DNA sequences using the BLASTx program. For a protein record to be linked to a GeneHub gene, we required 498% identity over an at least 35 amino-acid long region. The thresholds of BLAST cutoffs were empirically determined so that false linking was minimized without over-sacrificing true signal. Third, we added links that were based on existing annotations. For example, the gene2refseq file downloaded from NCBI contains relationships between Entrez Gene records and RefSeq and GenBank accessions, so an Entrez Gene record could be linked to a GeneHub gene if a RefSeq or GenBank accession was already part of the GeneHub gene.

Gene annotation and ortholog linking
Once we built the GeneHub gene collection associated with gene and protein identifiers from various databases, it became straightforward to collect and integrate gene annotation information from multiple databases. Useful information such as gene description, accession, name and synonyms were extracted for each GeneHub gene and stored in a common database field for text searching and gene characterization purposes (Supplementary Figure 1).
The ortholog linkings between human and mouse GeneHub genes were based on the hmlg_ftp.txt file from HomoloGene (www.ncbi.nlm.nih.gov/Homolo Gene/) Release 50.1. We used the orthologous Entrez Gene pairs of human and mouse if they were established by reciprocal best match between three or more organisms, or reciprocal best match, or sequence similarity with match identity 470%. We were able to link 15 868 human GeneHub genes to their mouse counterpart.

EST data collection and cleansing
EST data were downloaded from NCBI (http:// www.ncbi.nlm.nih.gov/dbEST/) and processed following a previous protocol to retain only non-normalized, nonsubtracted usable cancer and normal libraries (7)

EST-based expression analysis
We first mapped EST sequences to their genomic locations using GMAP (13), followed by a secondary filtering step as described earlier (490% identity and 490% coverage). To avoid ambiguity, we discarded the ESTs (1.3% of total) that were mapped to multiple genomic loci with identical percent identity and coverage. We also eliminated those ESTs (5.5% of total) with near identical matches (with 52% difference in percent identity or coverage) to multiple genomic loci. The genomic coordinates were compared with the gene structure information (intron/exon boundary) of GeneHub genes. An EST was considered to be the product of a gene if the two entities were mapped to the same locus and share at least one exon (with minimum overlap of 30 bp). Multiple EST reads from the same clone were reduced to a single read if clone information was available. The digital expression unit (DEU) for a given gene in each tissue category is defined as the number of matching EST clones from a normalized library size of 1 million. The DEU values were calculated iteratively for each tissue type to profile expression levels across all tissues. The Z-test was applied to determine whether DEUs in two samples were statistically different using a method described previously (7). Comparisons were made between normal and cancer samples from the same tissue, and across different types of tissues. To improve the efficiency of the GeneHub-GEPIS program, the expression result for each of GeneHub genes was pre-computed for fast access.
For 1000 randomly selected human genes, we compared how ESTs were mapped to genes by BLAST (98% identity over 460 bp region) or by the above method. For 74% of tested genes, BLAST alone would identify at least one incorrect EST, as judged by its genomic location. By design, the new method disallows this type of erroneous mapping. It is worth noting that the median DEU level for human genes in normal tissues is 50.0 based on our new method, as compared to a median level of 70.0 derived from BLAST-based EST-mapping approach (7). The average number of ESTs mapped to a gene also reduced from 128.7 to 107.5 (or a 16% reduction). The significant reduction of ESTs mapped to genes reflects the high accuracy of EST mapping by GMAP (13) and the rejection of ESTs with promiscuous matches. Manual review of EST mapping results for randomly sampled genes also confirmed the much improved mapping quality. As a confirmation step, we compared our results with a set of EST-gene mapping data independently generated by the UCSC Genome Browser team, and we found a concordance of 497%. The UCSC data are based on BLAT (16) and a series of filtering steps, and are used for the EST alignment track in the UCSC genome browser.
MicroRNA expression analysis is based on the observation that miRNA precursor sequences can be found among ESTs (17,18), and that pre-miRNA expression levels correlate with mature miRNA expression levels (19). Given this relationship, we could use EST data to approximate miRNA expression levels in various tissues. We first collected genomic locations of the miRNA stemloop sequences from version 9.0 of miRBase (20), and then obtained all EST sequences that had any overlap with the miRNA stem-loops for expression analysis as described earlier.
The Regional GEPIS Atlas, which depicts the expression level of all genes in selected tissues for a given genomic window, was created in the similar fashion as described previously (7) but with the exception that we stored the genomic coordinates for all genes in a MySQL database instead of in plain text files.

Application implementation
The web front-end was written in HTML, javascript, CSS and Perl CGI. The Perl template module (HTML::Template) was used to achieve consistent look across different web pages. The Ajax technology was used to make the application more interactive so that text searches could be performed without leaving the input web page. We used the Prototype Javascript library (http://www.prototypejs.org/) to implement AJAX calls. This library supports AJAX interactions and provides utility functions for accessing page components and DOM manipulations. A MySQL database was used for data storage and retrieval (Supplementary Figure 1). For textbased searching, the query string can be a record identifier (e.g. accession) or gene name. The program queries DBXREF, GENE and GENE_SYNONYMS tables in sequential order to find a best match from the selected target species for the given query, regardless of the species of input record. The text search is case-insensitive and a begin-search is automatically performed if no exact match is found. All of the source code is available upon request.

PROGRAM DESCRIPTION
GeneHub-GEPIS is a tool for inferring human and mouse gene expression patterns based on normalized EST abundance in various normal and cancerous tissues. The design of the system is depicted in Figure 1. The application is composed of two parts: a front-end web interface for user input, data retrieval, display and download, and a backend engine to perform GeneHub-GEPIS analysis and data storage. The backend expression analysis relies on an integrated gene database we constructed that stores gene definitions and cross-references. Much of the documentation of this application is provided in the form of FAQs so that answers to commonly asked questions are provided on the spot.

Data input
GeneHub-GEPIS supports both text-and cDNA sequence-based data retrieval (Figure 2). For text, the application allows diverse input values ranging from gene symbols to various accession identifiers. It currently supports identifiers from common gene databases (GenBank, RefSeq, Ensembl, FANTOM, Entrez Gene, UniGene, miRBase), protein databases (PDB, UniProt) and commercial microarray platforms (Affymetrix and Agilent). For sequence-based retrieval users can either upload a single-sequence FASTA file to the web server or paste a nucleotide sequence in a text box. Users can also search for either human or mouse genes regardless the origin species of the input text and sequences. If the target species is different from the species of the input value, an ortholog search is automatically performed. The application also allows users to limit their search to a specified chromosome to avoid possible multiple matches to the input text.
It is worth noting that GeneHub-GEPIS allow direct URL access from any web pages, with a gene symbol or accession as argument. For example, when retrieving mouse c-Met results, the URL is: http://www.cgl. ucsf.edu/cgi-bin/genentech/genehub-gepis/web_search.pl? intype¼1&xrefid¼cMet&species¼mouse.
To obtain results for RefSeq identifier NM_001260 (human ERBB2), the URL is: http://www.cgl.ucsf.edu/cgi-bin/ genentech/genehub-gepis/web_search.pl?intype¼1&xrefid ¼NM_001260&species¼human. Using this feature, some of the gene-based web servers, such as the widely accessed UCSC Known Genes web pages, have created links to GeneHub-GEPIS results. This is important as it can greatly increase the use of this server.

Program output
Once a unique gene match is found as one of the predefined GeneHub genes, the program directly retrieves the pre-computed EST-based gene expression results. The initial result page provides navigation links to download the result, display EST hits by libraries, and view additional graphic charts. Expression data is displayed in both a tabular format and a graphic chart ( Figure 3A). The program also retrieves and displays information about input genes such as gene names, synonyms, description, genomic locations and provides links to the UCSC genome browser and others web resources such as GEO microarray results ( Figure 3B). Since it is often desirable to examine the expression pattern of an orthologous gene, we provide links that allow user to quickly navigate between the result pages of human and mouse ortholog pairs. In addition, users can specify a genomic region and tissues of interest to get a Regional GEPIS Atlas view, which displays the expression values of all of the neighbor genes (Supplementary Figure 2). We also implemented a frequently requested feature that displays library information and the number of EST hits in each library (Supplementary Figure 3). The libraries are grouped by their tissue type. Due to space considerations, only libraries with matched ESTs are displayed on the web, but the full list is available for download. The EST detail page can also be bookmarked.
For text-based searching, when multiple gene matches are found, the server displays the list of all matched genes with basic information, such as gene names and genomic locations, so that users can click one of the genes for further analysis.
For sequence-based searching, the program first tries to match the input sequence with our pre-defined gene collection of the same species using BLAST (460 bp match with 498% identity, and the top hit chosen). If a match is found in the same species, the program directly reports the matched gene's expression data. If the matched sequence is from a species different from that of the input sequence, the program queries a backend table to identify its orthologous gene for subsequent result display. In rare cases where the input sequence fails to match any predefined genes, the program resorts to a secondary BLAST search against the EST sequence database directly, and matched ESTs are used to compute expression results on the fly. In this case, we used the method described previously (7), and no gene annotation will be available for display.

Batch analysis and data download
To facilitate large-scale analysis, we implemented a batch analysis function and allow download of all backend data. For batch analysis, a text field is provided for pasting in a list of gene identifiers, which can be a mix of gene symbols and different types of accession numbers. For example, the input can be a list of Affymetrix microarray probe IDs supplemented with additional gene names. Upon submission, the program produces paginated result pages showing expression results for each of the input gene in a concatenated tabular format. Links to detailed information and graphical displays are provided for each gene. These tabular data, along with detailed breakdown of EST library information, can be downloaded in text files for further study. Figure 2. Screenshot of GeneHub-GEPIS web input interface. Users can perform either text-or sequence-based search for a targeted organism, with an option to limit the search to a specified chromosome.
Since the backend data can be potentially useful for other purposes, we provide a download page where all backend data can be retrieved. These include gene mapping and cross-reference data, exon and boundary definitions, EST mapping and associated library information, pre-computed expression results for all pre-defined genes, and detailed EST library distribution information for each gene. Such files can be used by power users for global surveys of expression across a large number of genes.

CONCLUDING REMARKS
Despite the increasing prevalence of microarray data, ESTs remain as a significant source of data for expression analysis, and can provide benefits over microarrays in some cases (7). However, the value of EST data can only be fully realized with the availability of powerful and userfriendly tools that transform loosely organized EST information into meaningful expression results. Guided by input from the user community, we aimed to make The bar chart displays the normal (blue) and tumor (yellow) DEU values of each type of tissue, and the table shows numeric data and statistics. The user can select tissues and specify a genomic range in this page to draw Regional GEPIS Atlas. (B) The gene summary section provides a short description of gene function, species, genomic coordinates and synonyms. It provides a link to navigate to the result for a gene's ortholog, links to download results, view EST hits by libraries, and view additional graphic charts, and links to additional web resources such as the UCSC genome browser.
GeneHub-GEPIS reliable, powerful, easy to use and widely accessible. At this point, GeneHub-GEPIS can report estimated expression levels in about 40 different types of normal and cancerous tissues for a given gene or a list of genes. As more EST data become available, it will be possible to analyze gene expression in more detailed tissue subtypes and for additional organisms. We have noticed a dramatic increase of traffic to our web server since the release of GeneHub-GEPIS, and we hope that GeneHub-GEPIS will stimulate greater usage of EST data and perhaps additional software development in this area.