Archive for April, 2011

T1DBase: gene models

I recently wrote an overview of T1DBase, an online resource for the type 1 diabetes (T1D) research community (Hulbert et al. 2007; Burren et al. 2011). I shall now describe one of the more interesting of my contributions to the project, calculating and graphically displaying gene models. In the month since I first started writing this document, the ‘Gene Models’ section of T1DBase has changed, to become a ‘Gene Overview’, which looks slightly different (that’ll teach me to take so long in writing this up). In particular, summary and consensus gene models are no longer displayed, and nor are gene models for older builds; these are still viewable via the T1DBase archive site. Here, I’ll describe the work that I did, rather than what appears on the current site. For your reading convenience, this post is available as a pdf pdf.

Genes and Gene Models
The first, and probably most difficult, step is to decide what a gene is in the first place. I started this work about 5 years ago, before the importance of non-protein-coding RNA was widely recognised, so the discussion here relates only to protein-coding genes, rather than RNA genes. I’m currently working on topics related to the evolution of RNA genes, so I think it’d be sensible to add them, which shouldn’t actually be too difficult. But anyway, how do we define a protein-coding gene? One answer might be “a section of DNA that is responsible for the creation of a functional protein”. But what if there are splice variants; which one do you choose, or do you merge them? And where do you get your information from? If from multiple sources, how do you deal with any conflicts in the delimitation of gene boundaries, or intron or UTR structure? And what if one source defines the same gene in multiple locations, perhaps on different chromosomes? You can deal with many of these issues by working with “gene models” rather than genes, where a gene model is a collection of structures for a single gene, from a single source. Figure 1 shows two gene models for the gene CTLA4, based on data from Ensembl and UCSC.

It is useful to have a common point of reference for gene models from different sources, and in T1DBase this is the RefSeq data from the NCBI. This ties in with what T1DBase considers to be a gene: in practical terms, something with an Entrez Gene ID. In previous versions of T1DBase, the NCBI gene is represented alongside each gene model in T1DBase, as a green box (Figure 1), but this is not shown in the current version of the site.

Figure 1. Gene models for CTLA4, from Ensembl and UCSC. Exons and introns are shown by boxes and connecting lines, respectively, and UTRs are highlighted in red. The green box displays the gene according to NCBI RefSeq data.

Gene Models in T1DBase
When I started on the T1Dbase project, the site already displayed gene models based on data from a number of sources. My task was two-fold: to make the graphical display clearer and prettier, and to find a way to usefully summarise gene models, both within and between different sources. At this point it would be remiss not to acknowledge the importance of discussions with, and advice from, my colleagues on the project, i.e. my fellow authors on the Hulbert et al. (2007) paper. Also, I used libraries from the BioPerl project, chiefly Bio::DB::GFF, Bio::SeqFeature and Bio::Graphics::Panel.

Gene models in T1DBase are shown on each gene page (e.g. the CTLA4 gene page). I won’t talk much about the graphical aspect of the work I did – I link to the code later, but it’s rather tied into the T1DBase code base to be of general use (although by all means contact me if you would like some more information or assistance).

It’s probably worth pointing out that T1DBase is not restricted to genes that are linked to type 1 diabetes, so if you are just after a nice display of gene models for a gene of interest, you can still use the T1DBase website. All of the NCBI genes are available, since you never know when a gene will be linked to diabetes (genes which have been linked will tend to have additional gene models). So if you want to look at gene models for any gene, based on data from 4 useful sources (CCDS, Ensembl, UCSC, and Vega), go to the T1DBase home page and type the gene name or ID into the search box in the top-right corner.

The most interesting aspect of the gene models in T1DBase relates to how multiple sources of data can eliminate spurious transcripts, and how data can be effectively summarised across all gene models.

Eliminating Incorrect Transcript Predictions
There are a few preparatory steps required before you can start weighing up transcripts from different sources, all of which are automated by a set of scripts that I wrote (and which continue to be maintained by the current T1DBase staff). These download the raw data for a given range of sources, species, and builds, convert them to GFF format, and load them into a Bio::DB::GFF database. One issue that arises with Ensembl and UCSC transcripts is that these sources assign Entrez Gene IDs to transcripts based on sequence similarity, irrespective of whether the gene and the transcript are at (roughly) the same location. This results in assignments which are wrong, so in order to sort out which Entrez Gene IDs go with which Ensembl/UCSC IDs, we cross-reference with RefSeq. A transcript is disregarded if it is on a different chromosome to the RefSeq-defined gene (chromosomes X and Y are considered to be the same chromosome); if it is on the same chromosome but a different strand; or if it is on the same chromosome and strand, but not within 100kb of the gene. This value of 100kb is chosen to be big enough to allow some variation, but to ignore infeasible large discrepancies; it is rather arbitrary, but works well in practice. (See the script for details.)

Having dealt with conflicts between the positions of transcripts within a data source, we turn our attention to comparisons across all of the data sources. We want to ascertain when a gene model from one or more sources is on a different chromosome or strand, or more than 1Mb distant (again, rather an arbitrary value for the distance, but sufficiently high that we are very unlikely to dismiss an accurate transcript). To start with, we need to decide on which sources we trust more than others (the discussion here relates to 4 sources, but the arguments are applicable to any sources). Ensembl and UCSC are comprehensive, but are largely automated; CCDS provides reliable positions for CDS regions but, by definition, no UTRs; and Vega is high-quality, hand-curated data. Using this information, for the transcripts with conflicting positions we apply the following logic to all of the transcripts for that gene:

  • If there is a single CCDS or Vega transcript, consider that to be the most likely position (termed the ‘tentative’ position), unless there are both CCDS and Vega transcripts, and they have conflicting positions.
  • For all sources, across all transcripts, count how many support each conflicting position.
  • If there are three or more conflicting positions, add the counts for all but the most supported position (MSP) together, so that we have two counts for comparison.
  • If the MSP is only supported by one transcript, mark all of the transcripts as ‘undecided’, as there is insufficient evidence to automatically resolve the conflict.
  • If the MSP has multiple transcripts, then compare it to the number of transcripts supporting other positions; if the ratio of the numbers is above a certain value, mark the MSP as ‘accepted’ and the unsupported transcripts as ‘rejected’; otherwise, mark all transcripts as ‘undecided’.
  • The ratio in the previous step depends on whether a tentative position from CCDS or Vega exists. If not, the ratio is 2:1, i.e. the number of transcripts for the MSP must be at least twice the number of all other transcripts. If the tentative position exists, and agrees with the MSP, the ratio is 1:1 (a simple majority); in the case of disagreement, the ratio is 3:1 (a lot of support is required to outweigh a hand-curated position).

Once this process is complete, there are then three groups of transcripts, those that can be confidently used or ignored (i.e. the ones marked as ‘accepted’ or ‘rejected’, respectively), and those that require caution (the ‘undecided’ set). This information is loaded into the Bio::DB::GFF database, to enable useful queries and further data modification. The script automates the entire process of building gene models, from downloading the data from different sources, to evaluating when they do and do not agree.

Summary and Consensus Gene Models
As mentioned above, summary and consensus gene models are not shown on the current version of T1DBase, but can be viewed on the archive version of T1DBase, e.g. CTLA4. So, having collated data on transcripts from multiple sources that are (approximately) in agreement, it is then useful to examine how much variation in the detail of those transcripts, for example, which exons are best supported. In ‘summary gene models’, the number of transcripts that support each base are plotted as a bar chart (Figure 2, top panel). Transcripts from different sources that have almost identical CDS regions (i.e. within 1 base in either direction) are shown as ‘consensus gene models’ (Figure 2, bottom panel); the bounds of the UTRs are allow to vary somewhat, proportional to the length of the gene. Hand-curated sources are considered sufficiently reliable to warrant a consensus gene model of their own. The summary and consensus models are calculated with Perl modules (see next section) that require some other T1DBase modules and configuration files. But the code is well commented and fairly generic, so I think only minor tweaks would be necessary to use it elsewhere; please let me know if you would like help in doing so.

Figure 2. Summary and consensus gene models for CTLA4. Exons and introns are shown by boxes and connecting lines, respectively. Darker shades of blue indicate greater support than lighter shades. The green box displays the gene span, which shows the bounds of the gene according to NCBI RefSeq data.

Gene Models Code
All of the source code behind T1DBase is available under the GNU GPL (see the website for details), from a sourceforge subversion repository. The gene models Perl scripts are still under active development, but the associated GeneModel Perl modules that I wrote are no longer being used. I have included a copy of the GeneModel modules on my website, plus versions of the scripts that I have modified to work with a Windows installation of MySQL and a slightly different set of data sources than the current T1DBase site. As mentioned above, the GeneModel modules are integrated into the site, and won’t work out of the box; please let me know if you want to use any of the functionality, and I would be delighted to help out.


  • Burren OS, Adlem EC, Achuthan P, Christensen M, Coulson RMR, Todd JA. 2011. T1DBase: update 2011, organization and presentation of large-scale data sets for type 1 diabetes research. Nucleic Acids Research 39(Database issue): D997-D1001. PubMed: 20937630
  • Hulbert EM, Smink LJ, Adlem EC, Allen JE, Burdick DB, Burren OS, Cassen VM, Cavnor CC, Dolman GE, Flamez D et al. 2007. T1DBase: integration and presentation of complex data for type 1 diabetes research. Nucleic Acids Research 35(Database issue): D742-746. PubMed: 17169983


The “department-sized grouping of researchers” among whom I work, the Computational and Evolutionary Biology (CEB) group at the University of Manchester, have started a blog about our journal club (at which I have presented in the past).


Photographic evidence that I did, indeed, attend the latest MASAMB conference. James At MASAMB

MASAMB XXI, in Vienna

I was pleased to be invited to present some of my work at the MASAMB conference in Vienna recently (11-12th April 2011). My talk was titled “Quantifying the effect of evolution and genomic alignment on de novo RNA gene prediction”. I have a similarly themed manuscript in preparation, so won’t go into detail here. But if you have any questions about the presentation, then please get in touch by . The abstract is included below the nice pics.

I enjoyed both the conference and the time I spent in Vienna, where I appreciated the architecture and culture, as well as the beer and the wildlife:

Quantifying the effect of evolution and genomic alignment on de novo RNA gene prediction

James Allen and Simon Whelan
University of Manchester, Faculty of Life Sciences, UK

Non-coding RNA is biologically important and linked to a range of diseases, yet there are potentially many RNA genes of unknown function that do not belong to characterized RNA families. The de novo prediction of RNA genes is difficult because they are heterogeneous, and tend to have fewer restrictions than protein-coding genes. These problems are compounded by the lack of independent datasets for benchmarking predictions. We identify alignments of genes from 9 well-defined RNA families in UCSC genomic data for 32 vertebrate species, totaling 296 gene regions, and use randomization to create appropriate negative-control datasets. We evaluate three popular prediction programs (CMfinder, EvoFold, and RNAz), testing their ability to detect RNA genes and accurately determine gene boundaries. Our results show that RNA genes in genomic alignments have different evolutionary characteristics to structurally-aware RNA alignments, affecting gene prediction, and provide practical suggestions for the development and application of prediction programs.

Whelan Lab Journal Club: Seemann et al. (2011)

Members of Simon Whelan’s lab at the University of Manchester participate in a regular journal club, where a paper with an evolutionary/phylogenetic slant is discussed. Here, I briefly describe a paper that I recently presented, and summarize our conclusions about the methodology and results. (I have attempted to represent the discussion and consensus of the group, but any inaccuracies are my own.) For your reading convenience, this post is available as a pdf pdf.

PETcofold: predicting conserved interactions and structures of two multiple alignments of RNA sequences. Stefan E. Seemann, Andreas S. Richter, Tanja Gesell, Rolf Backofen and Jan Gorodkin (2011) Bioinformatics 27: 2, 211-219. PubMed: 21088024
(Presented by James Allen, 31st March 2011)

The paper in a sentence: Multiple sequence alignments and methods of RNA structure prediction can be combined to detect interactions between non-coding RNA molecules, in order to better understand their function.

Background: The biological importance of ribosomal and transfer RNA (rRNA and tRNA) has long been recognised, but in the last 10 years other non-coding RNA molecules (ncRNA) have been shown to have a variety of biological roles. As with proteins, the structure of, and interactions between, ncRNA define their functions, which are generally poorly understood.

The paper in detail: There are many methods available for predicting RNA secondary structure, one of which (PETfold: Seemann, et al. 2008) forms the basis for this paper. Briefly, the idea of PETfold is to combine information about the thermodynamic folding potential with explicit evolutionary models that rely on the covariance of base pairs in multiple sequence alignments. Interactions between RNA molecules can be viewed as a relatively simple extension of the structure prediction of a single RNA molecule, if it assumed that the interactions are primarily defined by the same process of canonical base pairing that creates the stems in RNA structures.

The PETcofold program concatenates two alignments of RNA that are assumed to interact, then attempts to predict both the structure of each RNA and their interactions in a two-stage process. (The authors term this hierarchical folding, which perhaps implies the potential for more than two stages, but this does not seem to be possible in the current implementation.) In the first stage, the structure of each RNA is predicted independently, and base pairs that are reliably predicted (i.e. above a particular threshold) are marked as constrained. In the second stage, further pairs are predicted for the unconstrained bases, both within and between the two RNA molecules. Such a process allows for the modelling of pseudoknots, which may have an important role in the function of RNA molecules. Some members of our group questioned whether this process was a biologically plausible scenario of RNA folding and interacting, but until we know more about the mechanisms involved it is probably reasonable. In any case, there is no theoretical reason why the same software cannot constrain and unconstrain base pairing in a more complex manner, to model biology more realistically.

The evaluation of PETcofold is somewhat problematic, as there is no well-defined dataset of RNA-RNA interactions against which to test the program. The authors gather a set of 32 interactions, of 13 different bacterial ncRNA molecules, based on experimental evidence, and use this to test different parameters of the model. It appears that by allowing only the most reliable base pairs in the first stage, the interacting sites are predicted with optimal accuracy, with a mean MCC of around 0.5. The MCC indicates the trade-off between sensitivity and positive predictive value (PPV), but these measures are not provided, so it is not clear in which area the program performs well (e.g. an MCC of 0.5 could arise from sensitivity=1 and PPV=0.25; from sensitivity=0.25 and PPV=1; or somewhere in between). Evaluating structure and interaction prediction was even trickier, with only four examples of interactions in the literature where the structure of two interacting molecules was also known. And one of these has to be discarded as being probably incorrect, so although PETcofold performs better than other programs, this is a rather small set on which to base general conclusions.

The authors find that very few interacting sites in their dataset show evidence of the covariance that their model is designed to detect, and thus use simulations to demonstrate that the model can take advantage of covariance if it does exist. However, the manner in which they introduce covariance, by multiplying an underlying tree by a factor of up to 200, is perhaps not ideal; this simulates a large amount of evolution across all sites in the RNA, which may affect the results.

It is rather optimistically stated in the discussion that PETcofold could be used to predict interactions between the results of genomic screens for ncRNA; however, given that such scans return thousands of results, the requisite pairwise combinations of predicted RNAs will be prohibitively computationally expensive.

Journal club conclusion: The program PETcofold is a potentially useful way to predict canonical base-pairing between RNA molecules, using covariance information. It is not yet clear whether there is, in practice, sufficient detectable covariance in such interactions, but if not, the thermodynamic aspect of the program, and the application of hierarchical folding, may result in predictions that are as good as, or better than, more complicated programs, and in less time.

Seemann, S.E., Gorodkin, J. and Backofen, R. (2008) Unifying evolutionary and thermodynamic information for RNA folding of multiple alignments. Nucleic Acids Research 36, 6355-6362.

Book Battle 3: Akira Book 5 vs Adrian Mole and the Weapons of Mass Destruction

After allowing weapons in the previous battle, it only seems natural for these books to properly tool themselves up. I’ve written about the first 4 Akira books before (part one, part two); they’re written by Katsuhiro Otomo, and volume 5 is the penultimate book in the series. Adrian Mole and the WMD is, clearly, the latest in the Adrian Mole series by Sue Townsend; I read the first one when I was about 13¾ myself, so I remember very little of it, and I probably didn’t get most of the jokes. As in previous book battles, it is the books themselves, rather than the eponymous characters, that square up, and the rounds are decided using the ‘All Adjectives’ list from my random word generator.

Round 1: Defiant
Adrian Mole and the WMD pulls a bazooka from its pages and fires off a missile that perversely represents it’s defiant anti-war stance. But the anti-establishment tone of Akira 5, and Kaneda’s complete lack of compromise, allow the book to leap out of the way while pumping a few rounds from a machine gun into its opponent. Verdict: Akira 5.

Round 2: Gentle
The gentle humour at the start of Adrian Mole and the WMD allows it to rally, and it aims a rifle squarely at Akira 5; but Gielgud the swan and the steeliness of the novel cause the gun to jam. Equally, Akira 5 has little to offer in this round – the romantic(ish) scenes with Kei and Kaneda, and Kaori and Tetsuo provide some gentle relief from the action, but this a sci-fi, action-filled, brute of a book Verdict: Draw.

Round 3: Sharp
The nimble, perceptive narrative of Adrian Mole and the WMD allows it to attach a bayonet to its rifle, while Akira 5 fiddles with a bio-dart gun. Akira 5 sustains a few lacerations before it fires off a bio-dart, and the needle-sharp detail and precision of the drawing, and the book’s wonderfully honed characterisation ward off a groggy Adrian Mole and the WMD. Verdict: Draw.

Round 4: Elegant
The stylish Akira 5 capitalises on its advantage, and it bombards Adrian Mole and the WMD with elegantly depicted visions of destruction. The broad satire of the latter causes it to stumble, and Akira 5‘s graceful lines carry the round. Verdict: Akira 5.

Round 5: Outstanding
Akira 5 tugs a small trigger from a hidden recess, carefully points it at Adrian Mole and the WMD, and squeezes. In orbit, a satellite whirrs, its base glows phosphorus white, and a brilliant laser arcs down from the heavens. Akira 5 literally blows Adrian Mole and the WMD away. Verdict: Akira 5.

The winner: Akira 5. Adrian Mole and the WMD is very funny in places, but is ultimately no match for the remarkable Akira 5.

The Perils of Feline Beauty

<Sigh> When a cat is as handsome as me, sometimes it’s hard to avoid the glare of the paparazzi.