Computational gene prediction using multiple sources of evidence

I've Read This
Methods
    
    Computational Gene Prediction Using Multiple Sources of Evidence
    Jonathan E. Allen,1,2,3 Mihaela Pertea,1 and Steven L. Salzberg1,2
    The Institute for Genomic Research, Rockville, Maryland 20850, USA; 2Department of Computer Science, Johns Hopkins University, Baltimore, Maryland 21218, USA This article describes a computational method to construct gene models by using evidence generated from a diverse set of sources, including those typical of a genome annotation pipeline. The program, called Combiner, takes as input a genomic sequence and the locations of gene predictions from ab initio gene finders, protein sequence alignments, expressed sequence tag and cDNA alignments, splice site predictions, and other evidence. Three different algorithms for combining evidence in the Combiner were implemented and tested on 1783 confirmed genes in Arabidopsis thaliana. Our results show that combining gene prediction evidence consistently outperforms even the best individual gene finder and, in some cases, can produce dramatic improvements in sensitivity and specificity.
    Computational identification of complete gene models in eukaryote genomes remains a challenging task (Zhang 2002). In the Arabidopsis genome project (The Arabidopsis Genome Initiative 2000), human experts integrated the output of different gene prediction programs with sequence homology data from searches of protein and transcript databases to construct the published gene models. Difficulties in creating accurate annotation arise for a variety of reasons. Sometimes the evidence for a gene is weak, consisting of just one gene prediction but no sequence homology, or just a single expressed sequence tag (EST) match. In other cases, the evidence is plentiful but contradictory: Different gene finders and protein sequence alignments may indicate many overlapping candidate genes, and more than one of these models may in fact be correct. Thus, the process of constructing final gene predictions, even with a human curator in the loop, is timeconsuming and fraught with opportunities for errors. For these reasons, we have developed a computational method that evaluates much of the same information human annotators use, as a means of creating gene models that are both more accurate and more consistent than can be derived from any single computational gene prediction algorithm. This article describes the Combiner program, a statistical algorithm that uses the output from other annotation software to improve the accuracy on predicted genes while maintaining a good balance between sensitivity (the number of true genes detected) and specificity (the number of gene predictions that are correct). Other efforts to combine gene model evidence have primarily focused on integrating the output from gene prediction programs (Murakami and Takagi 1998; Pavlovic et al. 2002; Rogic et al. 2002). The Combiner algorithm integrates multiple gene prediction programs plus much of the evidence available in a typical annotation pipeline, including evidence from proteins, ESTs, cDNAs, and splice site predictions. Other approaches to incorporating multiple evidence types can be found in the EuGène (Schiex et al. 2001) and GAZE (Howe et al. 2002) systems. We tested three algorithms for use in the Combiner. The first algorithm is a simple Linear Combiner (LC1) that uses a voting function to combine multiple gene prediction programs. Each gene finder is given equal weight, that is, one vote, in LC1. The algorithm first identifies all possible disjoint intervals sepa3 Corresponding author. E-MAIL jallen@tigr.org; FAX (301) 838-0208. Article and publication are at http://www.genome.org/cgi/doi/10.1101/ gr.1562804.
    
    1
    
    rated by signals such as splice sites or start codons, similar to signal-based ab initio gene prediction (Salzberg et al. 1998). A dynamic programming algorithm constructs gene models from candidate signals predicted by any of the gene prediction programs, systematically enumerating all possible combinations of signals (start, stop, donor, and acceptor sites). The sequence intervals between these signals are voted on by the gene finders. Each gene finder must vote for either coding or noncoding, and the highest-scoring combination of intervals is pieced together to form a gene model. The second Combiner (LC2) uses a similar dynamic programming algorithm to LC1, but with two significant enhancements. First, it adds sequence alignments (both DNA and protein) and splice site prediction programs to the inputs. Second, it uses different weights for the different forms of evidence. Aligning genomic sequence to a transcript or protein database produces matches with widely varying amounts of similarity. Clearly, the similarity of each alignment should be a factor in scoring the quality of a predicted exon. However, the degree of similarity of a match does not directly translate into the likelihood that a region encodes a protein, and the alignment algorithms themselves (e.g., BLAST; Altschul et al. 1990) do not produce such likelihoods. Further complicating matters, LC2 uses splice site predictions to indicate potential exon boundary signals, but because it scores intervals (not points between intervals), it does not include the splice site prediction in the linear weighting function. To address these limitations, we developed a statistical scoring method that uses decision trees (in particular, the randomized oblique decision trees of Murthy et al. 1994) to correlate evidence patterns with candidate gene models. This Statistical Combiner (SC) uses the confidence scores output by the gene finders themselves (when available), which were also used by Rogic et al. (2002) to combine outputs from two gene finders. Instead of a simple linear function combining all the inputs, SC builds a nonlinear model based on a decision tree. A description of each approach is given in the Methods section.
    
    RESULTS
    The three Combiners were tested on a data set of 1783 cDNA confirmed genes in Arabidopsis thaliana. These reference genes are based on full-length cDNA sequences that have been aligned to the genome and subsequently examined and defined by human annotators (Haas et al. 2002). This carefully curated data set provides a large and reliable source for evaluating the accuracy of
    
    142
    
    Genome Research
    www.genome.org
    
    14:142–148 ©2004 by Cold Spring Harbor Laboratory Press ISSN 1088-9051/04; www.genome.org
    
    Computational Gene Prediction Using Multiple Sources
    
    our methods. The results are divided into two sets based on the type of evidence used. The first set consists of gene prediction programs trained prior to the availability of the test genes and includes GlimmerM (Pertea and Salzberg 2002), GeneMark.hmm (Lukashin and Bordovsky 1998), Genscan+ (Burge and Karlin, 1997), and GeneSplicer (Pertea et al. 2001). The second test set includes recent TwinScan (Flicek et al. 2003) predictions available at http://genes.cs.wustl.edu/predictions/Arabidopsis/ contig 04 23 03 and a newer version of GlimmerM that includes improved modules to better detect translation start sites and polyadenylation sites. In both test sets, the Combiners using homology data take input from alignments between the genomic DNA and protein sequences from a non-redundant amino acid database aligned by using the dps and nap programs (Huang et al. 1997), as well as gene transcripts (including both EST and cDNA sequences) from the TIGR gene index database (Quackenbush et al. 2001) aligned using dds and gap (Huang et al. 1997). The alignment data was filtered to remove proteins, ESTs, and cDNAs that are included in the 1783 genes, which might bias the Combiner’s results. We removed all sequences that align with 100% identity regardless of the source of these sequences. All programs were run on 515 bacterial artificial chromosomes (BACs) with lengths of ∼100 kbp and collectively span roughly one third of the Arabidopsis genome. The three Combiners were run on each BAC, using the evidence from the other software programs as input, and the predictions were compared with the locations on the BAC corresponding to the 1783 cDNA confirmed genes. The test genes specify complete coding regions from the start codon to the stop codon (including possibly introns) and can occur on any portion of the BAC sequence. Our primary purpose in these tests is to take an existing set of predictions from gene finders along with the output from other gene evidence and see if we can combine their output to produce a more accurate set of gene models. Because our tests genes are already known from cDNA evidence, the tests may favor the Combiners using sequence alignment evidence, particularly if these genes are more frequently expressed. Therefore, in addition to considering the results of combing only gene finders in LC1, we also report results from the SC using only the gene finders and splice site prediction program as input.
    
    Figure 1 The number of correct and incorrect (number in parentheses) whole gene model predictions shared among the three prediction programs: GlimmerM (GA), Genscan+ (GS), and GeneMark.hmm (GM) from a test set of 1783 genes. Incorrect gene refers to cases in which all coding exons in the gene are in perfect agreement among the gene finders but not with the true gene.
    
    Test Set 1
    Our initial test used three single-organism gene finders as inputs: Genscan, GeneMark.hmm, and GlimmerM. Figure 1 shows the overlap among correctly predicted gene models from each of the gene finders, in which “correct” is defined to mean that all coding exons were in perfect agreement with the true gene. Only 178 (10%) of the genes were correctly predicted by all three methods.
    
    As the figure makes clear, each of the three gene finders has a set of genes for which it is the only correct method. The most accurate single program, GeneMark.hmm, predicts 799 of 1783 genes (45%), but if we could produce an oracle that always chose the best model from the three gene finders, then it would make 1205 (68%) correct predictions. Thus, if the Combiner can cleverly pick a gene model from among the three gene prediction programs, it should be able to improve on the performance of all three. Results for the four Combiners—LC1 and SC-g using only gene finders as input and LC2 and SC using gene finders plus homology data—are listed in Table 1. Because both SC and SC-g require a training set, we trained them by using a separate set of 380 genes (none included in the 1783 test set) from 120 BACs. All genes in both training and test sets were confirmed by full-length cDNA sequences that had been mapped to the genome (Haas et al. 2002), currently the gold standard for determining the correctness of a gene. For this initial test, the SC gets 66% of the test gene models exactly correct in contrast to the best individual gene finder,
    
    Table 1. Performance of the Gene Predictors on 1783 Genes CG SC SC-g LC2 LC1 GM GS GA 1179 (66%) 1114 (63%) 1078 (61%) 967 (54%) 799 (45%) 650 (37%) 557 (31%) MG 31 12 19 21 31 43 39 (2%) (1%) (1%) (1%) (2%) (2%) (2%) CE 6625 6580 6468 6323 5947 5680 4610 (88%) (88%) (86%) (84%) (79%) (75%) (61%) ME 311 (4%) 220 (3%) 298 (4%) 300 (4%) 363 (5%) 722 (10%) 1183 (16%) WE 27 71 44 57 95 41 415 SN 97% 98% 98% 97% 96% 92% 85%
    
    SC indicates Statistical Combiner; SC-g, SC combining gene prediction programs only; LC2, Linear Combiner using sequence alignments; LC1, Linear Combiner using gene prediction programs only; GA, GlimmerM; GM, GeneMark.hmm; and GS, Genscan+. The columns are number of whole genes correctly predicted (CG), number of genes completely missed (MG), correctly predicted exons out of the 7510 total (CE), number of exons completely missed (ME), predicted exons overlapping a gene region but do not overlap a true exon (WE), and percentage of protein coding nucleotides correctly detected (SN).
    
    Genome Research
    www.genome.org
    
    143
    
    Allen et al.
    
    Table 2. Breakdown of Combiner Predictions When Matching Exactly Three, Two, One or Zero Gene Prediction Programs Combiner SC SC-g LC2 LC1 SC SC-g LC2 LC1 SC SC-g LC2 LC1 SC SC-g LC2 LC1 No. of GP 3 3 3 3 2 2 2 2 1 1 1 1 0 0 0 0 CG 178 178 178 178 440 417 418 401 419 395 363 307 142 124 119 81 WG 5 5 5 5 87 70 76 94 229 308 263 286 253 282 348 412 CG/CG + WG (%) 97 97 97 97 84 86 83 81 65 56 58 52 36 31 26 16
    
    consensus. Interestingly, there are five genes for which all gene finders agree but for which the reference cDNA alignment indicates a different gene model.) But even when they agree with just one of the gene finders and disagree with the other two, the Combiners get 52% (LC1) to 65% (SC) of the predictions correct. And all four Combiners correctly predict some genes that are missed by all three gene prediction programs. This is possible because the Combiners are able to piece together parts of a gene model from the different inputs to make a new model. As Table 2 shows, LC2, LC1, and SC-g are competitive with SC when at least two gene prediction programs agree with one another, but SC has a distinct advantage when the one or no gene finders make a correct prediction.
    
    Test Set 2
    We constructed a second set of comparisons by adding the TwinScan algorithm, which, unlike any of the other gene finders, uses sequence homology with a related species to inform its gene finding. By using alignments to Brassica oleracea, TwinScan is able to achieve substantial improvements over the best of the gene finders in our original set. Our hypothesis was that the Combiner should improve still further, using the better predictions from TwinScan to boost its performance over the first round of tests. For this test, we used up to five gene finders as inputs: the three from the first test, TwinScan, and a newer version of GlimmerM. Similar to what is shown in the Venn diagram in Figure 1, all five prediction programs predict a different set of correct gene models. Table 3 lists the number of gene models each gene prediction program exclusively identified correctly. In total, 1496 of the 1783 gene models are identified correctly by at least one of the five gene prediction programs. Therefore, each prediction source provides potentially useful information. Because both LC1 and LC2 combine each evidence source by using a linear sum of weights, we expected that they might have difficulty combining the outputs from the highly correlated versions of GlimmerM. Results are summarized in Table 4. Again, we include a version of SC, which combines only the gene finders and splice prediction program. TwinScan is the most accurate gene prediction program, with performance that compares favorably to the best Combiner results from the first experiment (Table 1). TwinScan predicts 67% of the gene models and 87% of the exons correctly, completely missing just 17 genes. With the advantage gained from having TwinScan as input, the SC jumps to 78% (1385) correct gene models and 93% correct exons. The SC using only gene finders (SC-5g) also improves, getting 75% of the gene models correct and 92% correct exons. Both LC1 and LC2 improve by using TwinScan, with 68% and 73% of the gene models correct, respectively. When one of the gene finders is clearly superior to the others, the Combiner should give it a higher weight. This happens automatically in the training of SC, but not in LC1 or LC2. Both LC1 and LC2 would likely improve substantially here if the weights were better tuned to reflect the relative performance
    
    The first column (combiner) refers to the four Combiners. The second column refers to the number of matching gene prediction programs. The third column and fourth columns count the number of times the Combiner prediction is correct (CG) and not entirely correct (WG). The fifth column is the percentage of correct predictions.
    
    GeneMark.hmm, which gets 45% correct. (Here “exactly correct” means that the entire coding region is correctly predicted, from start to stop, with all intron boundaries correctly identified.) LC1 and LC2 also improved over the individual gene finders, with 54% and 61% correct, respectively. SC-g gets 2% more test genes correct (63%) than does LC2. SC and SC-g identified 88% of the exons correctly (both the 5 and 3 ends were correct), whereas the individual gene finders’ accuracies ranged from 61%–79%. LC1 and LC2, at 84% and 86%, showed intermediate levels of performance gain. A missed gene (exon) occurs when no predicted exon overlaps the gene’s exons (exon) by one base or more. LC1, LC2, and SC missed small and roughly equal numbers of exons (4% of the 7510 total), but SC-g missed 3% of the exons and the LCs and SC-g missed fewer gene models, only 1%, compared with 2% for SC and for the best individual gene finder. Of the 31 whole genes missed by SC, 19 are single exon genes, indicating a possible area for improvement. At the nucleotide level, which measures the percentage of coding bases in the test genes correctly labeled as coding, all four Combiners achieve 97%–98% accuracy. And the number of “wrong exons,” predicted exons that do not overlap a true exon at all, is substantially smaller for SC, with 27 wrong exons, than for any of the gene finders, which had from 41 to 415 wrong exons. The lower overall correct gene count for GlimmerM is due to a tendency to insert short exons (the average exon length is 54 nucleotides). An additional measure of Combiner performance is its accuracy relative to the agreement among the gene finders. If for example, the Combiner gene model is only correct when it agrees with at least two of the ab initio gene finders, then a simple majority-vote rule might work just as well. Results comparing the performance among all four Combiners with respect to the gene prediction programs are summarized in Table 2. The table shows results divided according to whether each Combiner agreed with zero, one, two, or three gene finders; for example, SC agreed with two of the gene finders on 527 genes, and those predictions were correct for 440 (84%) of them. The Combiners get 97% of the genes correct when all three gene finders agree; these are presumably “easy” genes for automated prediction. (Note that when all three gene finders agree, the Combiners never disagreed with the
    
    Table 3. The Number of Gene Models Each Gene Finder Exclusively Predicts Correctly in Test Set 2 Program TwinScan GeneMark.hmm GlimmerM GlimmerM2 Genscan+ Correct genes 206 59 41 39 31
    
    144
    
    Genome Research
    www.genome.org
    
    Computational Gene Prediction Using Multiple Sources
    
    Table 4. Performance for Gene Predictors Including TwinScan and Retrained GlimmerM in Addition to the Programs Listed in Table 1 CG SC-5 SC-3 SC-5g LC2-3 LC1-3 TS GM2 1385 (78%) 1345 (75%) 1320 (74%) 1293 (73%) 1206 (68%) 1200 (67%) 563 (32%) MG 17 24 16 12 14 17 10 (1%) (1%) (1%) (1%) (1%) (1%) (1%) CE 6952 6911 6878 6810 6692 6569 5321 (93%) (92%) (92%) (91%) (89%) (87%) (71%) ME 196 194 173 156 207 299 673 (3%) (3%) (2%) (2%) (3%) (4%) (9%) WE 22 34 37 77 48 66 386 SN 98% 98% 98% 99% 98% 96% 93%
    
    SC-5 indicates SC using all five gene prediction programs; SC-3, SC using three gene prediction program; SC-5g, SC using five gene prediction programs and no alignment data; LC2-3, LC2 using three gene prediction programs; LC1-3, LC1 using three gene prediction programs; TS, TwinScan; and GM2, newer GlimmerM output. The three prediction programs used by SC-3, LC2-3, and LC1-3 are TwinScan, GeneMark.hmm, and GM2.
    
    by Genscan, filtered according to whether they correspond to BLAST (Altschul et al. 1990) matches to a protein database. Ensembl also incorporates EST alignments to refine the predicted gene models. Each gene reported from the automated pipeline is supported by sequence alignment evidence, but it is not clear how many of these regions correctly identify each gene model DISCUSSION boundary: translation starts, splice sites, and stop sites. Neither of Two of the leading resources for annotation of the human gethese human genome pipelines incorporate an explicit method nome are Ensembl and Nation Center for Biotechnology Inforfor combining multiple gene finders; because our Combiner is mation (NCBI), in which each applies a different collection of open source, it should be easy for these and other annotation computational tools to construct their gene predictions (Birney providers to include it in their pipelines. et al. 2002). NCBI aligns all known genes from the Reference For many organisms, multiple gene finding tools successSequence database and from GenBank mRNA sequences to the fully identify protein coding regions in the genome. Our results genomic sequence by using MegaBLAST (Zhang et al. 2000), reshow that, even in cases in which one program is clearly more taining matches with 95% identity and covering 50% of the accurate, other prediction tools provide useful information, corputative transcript (http://www.ncbi.nlm.nih.gov/genome/ rectly finding some exons and genes that the other programs guide/build.html#gene). Alternate overlapping models are miss. The difficulty lies in checking each protein coding region to merged into a single gene. Additional genes are reported based on decide when and how to use each piece of evidence. LC2 does GenomeScan (Yeh et al. 2001) predictions, excluding those that surprisingly well considering it only requires that the user assign overlap the set of genes identified by alignment. Ensembl takes a a weight to each evidence source (see Methods). SC, however, similar approach, first aligning known proteins to the genome provides a more robust model for incorporating different types of and using GeneWise (Birney and Durbin 1997) to determine the evidence. It uses training data to build its own nonlinear model gene models. Additional predictions come from exons predicted for combining the evidence. The SC also provides a way to make use of multiple overlapping gene models from a single prediction program, for example, those produced by different parameter settings. This allows the Combiner to decide which alternative model is best supported by sequence alignment evidence, rather than relying on the single best prediction. An important element of the Combiner approach is to treat each source of evidence as a black box, which enables the use of gene model evidence from any source, as long as the predictions are provided as sequence coordinates. Separating the Combiner from the evidence software allows us to apply the Combiner to each genome sequencing project by using sequence analysis software specific for that organism. The success of the Combiner deFigure 2 Partitioned output from three evidence types: splice predictions, gene predictions, and sequence alignments. The five sources of evidence (listed in order from top to bottom) are output from pends on the accuracy of the underlying a splice prediction program (SP); a gene prediction program (GP1) with exon confidence scores 0.9 evidence and continued improvements and 0.89; a gene prediction program (GP2) with no confidence scores; 89% and 45% identity alignin gene prediction algorithms, as illusments from a protein database, which make up a single evidence source; and 32% and 20% identity trated by the TwinScan results in this alignments from an EST database. The genome sequence is divided into intervals defined by each study, should improve future Combiner potential boundary x1,x2,…,x8. The non-overlapping intervals I1,…,I7 are used to score gene models. results. The predicted splice site at x5 is associated with I5. among the gene finders. Unlike LC1 and LC2, SC performance improves further when adding one or two more gene prediction programs on top of the first three. Table 4 shows these results as SC-3 for three gene finders and SC-5 for all five.
    
    Genome Research
    www.genome.org
    
    145
    
    Allen et al.
    
    quence Sx…y. For protein coding intervals, we compute hk for gene finder k by simply counting the number of bases predicted to be coding by that gene finder; conversely, we count the number of bases predicted to be noncoding for noncoding intervals. Essentially, this formula says that for signal Sigy, we scan back and compute a score for each previous signal by adding together the previous signal’s score plus the weighted evidence for the intervening sequence. We then choose the best total score and store it at y. We construct a gene model by tracing the “parse” back through the matrix. Because computation is done only at positions containing one of the signal types, the computational complexity is O(mn2), where m is the number of gene finders and n is the number of signals detected. Both the SC and LC2 use a modified version of the LC1 algorithm to construct gene models from non-overlapping sequence intervals. The most significant difference is that these algoFigure 3 An example of four overlapping candidate gene models G1 through G4. The exons are rithms include sequence alignment inassumed to be part of the same coding frame. In this example, if the evidence only predicts G1 and formation as additional evidence types. G2, the combiner scores G3 or G4 if either model is optimal. The other major difference is that rather than computing scores only between predefined signal positions, the algoMETHODS rithm computes scores for subsequence Sx…y, where x and y can be (1) the locations of signals or (2) boundaries of alignment Linear Combiner regions. Figure 2 illustrates how a genome is split into sequence intervals where interval I1 = Sx1…(x2 1), I2 = Sx2…(x3 1), etc. Note Four types of signals are considered for LC1: start codons, stop that the intervals in Figure 2 can begin or end in the middle of an codons, splice donors, and splice acceptors. Processing the input exon (or intron), because alignments are not constrained by exon sequence from left to right, LC1 computes partial gene models boundaries. Figure 3 shows an example in which multiple overrepresenting the optimal sequence of signals leading to any given lapping gene models occur in a single reading frame. Interval I4 signal in the sequence. Signals are linked together in a gene is part of four different candidate gene models: It is alternatively model only when such a link is biologically meaningful; for ina complete exon, the interior of an exon, the 5 end of an exon, stance, a donor site (the end of an exon) can only be linked back or the 3 end of an exon. If these predictions are all in the same to a previous acceptor site (the beginning of an exon) or a start reading frame, only the highest-scoring partial gene model will codon. In either case, the sequence between the donor site and be stored at x5. the previous signal is scored as a protein coding interval. An The evidence for each subsequence Sx…y is captured in a acceptor site can only be linked to a preceding donor site, and the vector v = [v(1), v(2), …, v(m)], for m different types intervening sequence is scored as a noncoding interval. We use a of evidence. We compute the values v(k) by using the scores from dynamic programming algorithm that scans the signals from left the gene finders or the percentage of similarity from the alignto right (5 to 3 in the DNA sequence) and, for each interval ment algorithms. For gene finders that do not score each exon, bounded by a pair of signals, computes a separate score for each we use a value of one for predicted coding intervals and zero for of the three possible reading frames. (This ensures that all exons noncoding intervals. Because many protein (respectively, EST) in a gene model are in the same frame.) The score is computed as sequences can align to the same place, we choose the alignment a weighted sum of the evidence. More precisely, let S be an input sequence, Sx be the base at position x, Sx…y be the subsequence spanning the interval from x to y, and Sigx and Sigy be Table 5. The Set of Labels That Describe Each Sequence Interval and Are Used to a signal at location x and y, respectively. Construct Gene Models on the Positive Strand We compute scores in a dynamic proInterval labels Acceptor (a) Start (r) Coding (c) Donor (d) Stop (t) gramming matrix D, as follows: D Sigy = +
    k 1
    
    max
    
    x0.3 and a protein aligns with >50% identity, whereas Leaf2 considers only one condition: whether a prediction is made by GlimmerM. The criteria for splitting each node in an OC1 tree is nondeterministic in order to consider the wide range of possible solutions. As a result, a different OC1 tree is generated each time the training program is run. Because any one tree may not produce the best results, SC uses 10 decision trees for each of the five event models (acceptor, donor, coding, start, and stop). A single probability value is the average value from the 10 trees. Each decision tree is generated by using the default parameters for the OC1 software.
    
    Program Availability
    The original Linear Combiner (LC1) is implemented in Perl, and LC2 and SC are implemented in C++. The software is an open source package and freely available at http://www.tigr.org/ software/combiner.
    
    ACKNOWLEDGMENTS
    This work was supported in part by the National Institutes of Health under grant R01-LM06845. We thank the two anonymous reviewers for their helpful comments. The publication costs of this article were defrayed in part by payment of page charges. This article must therefore be hereby marked “advertisement” in accordance with 18 USC section 1734 solely to indicate this fact.
    
    WEB SITE REFERENCES
    http://www.ensembl.org; ENSEMBL. http://www.ncbi.nlm.nih.gov; NCBI. http://www.tigr.org; TIGR. http://genes.cs.wustl.edu; TWINSCAN.
    
    Received May 20, 2003; accepted in revised form November 4, 2003.
    
    148
    
    Genome Research
    www.genome.org

Readers

 

Academia © 2009