Archive for the ‘Scholarly Thoughts’ Category

CEB Journal Club: Price et al. (2012)

Members of the Computational and Evolutionary Biology (CEB) group at the University of Manchester participate in a monthly journal club, where a paper of broad interest is discussed. Here, I briefly describe the paper and its context, and summarize our conclusions about the methodology and results presented. (I have attempted to represent the discussion and consensus of the group, but any inaccuracies are my own.) For your reading convenience, this post is available as a pdf pdf.

Cyanophora paradoxa genome elucidates origin of photosynthesis in algae and plants. Dana C. Price et al. (2012) Science 335: 843-847. PubMed: 22344442
(Presented by James Allen. on Pi Day, 14th March 2012)

The paper in a sentence: Red algae, green algae and land plants, and glaucophytes (i.e. Plantae) are monophyletic; the photosynthetic ability of all plants derives from a single primary cyanobacterial endosymbiosis in their common ancestor.

Background: Photosynthesis is possible in plants due to the presence of a plastid, the remnant of an ancient endosymbiosis between a eukaryotic cell and a cyanobacteria. Until recently it was believed that a single, primary, endosymbiosis occurred in the ancestor of all plants and algae, and analysis of the plastid genome confirmed this. However, some phylogenomic analyses, enabled by increasing volumes of sequence data, provide either weak or no support for the monophyly of Plantae, suggesting the possibility of multiple endosymbioses. Resolving these issues is interesting because it sheds light (pun intended) on the nature of the first photosynthetic algae, and illuminates (OK, I’ll stop now) the fascinating, billion-year-old, events which gave rise to, ultimately, the daffodils in my front garden.

The paper in detail: There are three divisions in the Plantae, two of which (red and green algae) have been well studied; until now, relatively little information has been available for glaucophytes, a species-poor group of algal protists that constitute the third division. Price et al. sequence the genome of a glaucophyte, Cyanophora paradoxa in an attempt to gain sufficient data to convincingly confirm or refute the monophyly of Plantae. To move onto the more interesting stuff, I assume that the sequencing was done sufficiently well to provide reliable data (the authors give enough detail in the supplementary material to justify this assumption).

From a set of almost 28,000 predicted proteins, almost 4,500 had prokaryotic or eukaryotic homologs and were of passable quality. The authors state that they “generated 4,445 maximum likelihood trees from the C. paradoxa proteins and found that >60% support a sister-group relationship between glaucophytes and red and/or green algae with a bootstrap value ≥90%”. There are, I believe, a few problems with this sentence, if my interpretation of Figure 1B (reprinted below) is correct. In the first column of the figure, a total of 417 is given (subsequent columns are subsets of this first column, so we can ignore those for now); this is the number of maximum likelihood (ML) trees which contain 3 or more phyla, and in which the branch that places the glaucophytes in a monophyletic group has a bootstrap value ≥90%. So, >60% of “4,445” trees do not support Plantae monophyly; >60% of 417 trees do. But, in fact, only 44 of these trees contain all three divisions of Plantae; 118 pair glaucophytes with either red or green algae and a further 112 trees show evidence of endosymbiotic gene transfer (EGT), and are assumed to support monophyly. The evidence of these last two sets is certainly consistent with monophyly, but is weaker than cases where all three divisions are present.

The approach taken here, to discard a large amount of data that does not meet a relatively arbitrary bootstrap criteria, seems wasteful. There are established methods for combining information in multiple gene trees to generate a species tree (‘supertree’ methods), and although these are not always straightforward to use, I would have expected at least some discussion on why they were not applied. Another option for phylogenomic analyses is the supermatrix approach, in which protein sequences are concatenated before tree inference. Supermatrices may not be effective if the proteins do not share approximately the same evolutionary history, i.e. if EGT or HGT has occurred; but since the authors are able to fairly confidently detect these events (e.g. Figure 1C and 1D in the paper), these proteins could have been excluded from the analysis. Even if the authors’ approach is taken to be valid, >60% support for Plantae monophyly is not terribly convincing (incidentally, it is curious that the authors understate their case here, since the support value is actually 66% – why round it down?).

A final issue with the analysis of ML trees (before we move into more positive territory) is the lack of detail in the description of ML tree inference. The authors state that they are ‘using phylogenomics’ and reference a previous paper, in which a similar analysis is done; but the materials and methods section in that paper lacks detail, and some of it clearly does not apply. A concrete example of why this matters: RAxML can generate bootstrap replicates in different ways, which often does not affect any conclusions, but might be important here, where the analysis relies heavily on a particular cut-off value; it probably doesn’t make a difference, but the reader lacks the information needed to appropriately interpret the results.

I’ve gone into some detail about one aspect of the paper (the bit most pertinent to my area of research), and I am aware that I have been rather critical; often, in posts such as this, there is an understandable tendency to hem and haw, and obliquely imply that ‘perhaps the authors might have considered this or that’ and so on. But, I think my criticisms are fair, and are probably more interesting to read than the impending paragraph about the remainder of the paper, in which more convincing evidence of monophyly is presented…

The authors describe a number of proteins that are essential to eukaryotic photosynthesis, and demonstrate that these strongly suggest a common origin for red algae, green algae, and glaucophytes. The biological underpinning of these arguments, based on relatively few gene trees, is far more persuasive than the preceding phylogenomic analysis. There are also some interesting details on the gain and loss of fermentative enzymes, which would be clarified further with a greater number of species in the analysis.

Journal club conclusion: In places, the paper lacked sufficient, unequivocal, detail about the methods for us to wholly trust their conclusions. Some lines of evidence, however, were quite convincing, and we tend to believe that the Plantae are indeed monophyletic. Wider discussion of phylogenomics revealed a growing distrust of the results of such analyses; different researchers can arrive at contradictory answers to the same question, depending on how a dataset is selected and the exact nature of the analysis. Phylogenomics is necessarily complex, which makes it crucial for researchers to be meticulous when describing their data and methodology, so that we are able to decide to accept or reject their conclusions.

Formal Grammars

Throughout the course of my PhD I have found it useful to write summaries of the various aspects of phylogenetics and biology that I have learned. That these will be useful to others is perhaps a vain hope, in both senses of the word, but I thought I might as well publish some of them on my blog.

Formal grammars are an extremely useful tool for conceptualising and analysing nucelotide sequences, and I have briefly summarised the topic, with an example of their application in the modelling of RNA structure. The summary includes some figures which make for somewhat awkward reading in blog format, so I’ve made it available as a pdf pdf.

Citing the Document
[If referring to the document, please cite its location on the Monkeyshines website:]

LyX: LaTeX, the Easy(ish) Way


I’ve resisted learning TeX/LaTeX for years; I appreciate the principle, but I baulk at the idea of learning another language, particularly one which requires a bunch of opaque commands at the top of each file. I like a nice GUI, and Word, for all its faults, does give you lots of control if you use the styles correctly. And the EndNote Web plugin allows you to sort out referencing without too much pain.

And yet, and yet… LaTeX is the right way to do things, from a typesetting viewpoint, and also because it correctly places the emphasis on content. What’s a graphically minded chap to do? I recently discovered LyX, which claims to fill this gap in the market. I thought I’d see how easy it was to turn out a nice looking document with LyX by replicating a Word doc that I was intending to post on this blog (on tree comparison).

Citing References

To make life a bit easier for myself, I shelved sorting out a BibTeX library, and just copied the references from the Word doc. Getting the citations to appear correctly was trickier than I thought it should be; it was easy enough to change the document settings so that the bibliography was ‘natbib’ format. But then I had no control over the precise display of the citation, e.g. (Mackenzie, 2011) versus Mackenzie (2011). I resorted to writing the TeX tags directly, aided by an excellent Natbib reference sheet. But then I got some errors when compiling the LyX document in PDF format. After a bit of digging, I determined that the \usepackage{natbib} had disappeared from the LaTeX preamble – other than entering it manually, I couldn’t figure out how to get it back.

Inserting a Header

Getting a header was also somewhat convoluted, and again involved dipping a reluctant toe into grimy TeX-infested waters. I had an intermediate problem in that the help within LyX is not searchable, which seems odd; but the interwebs told me that I needed to switch on fancy headers in the page layout settings, and then amend the preamble again (grrr) with \lhead{}, \chead{}, and \rhead{Tree Comparison - James E. Allen}. You need the empty definitions to suppress default headings, which seems more annoying than useful.


I used Word’s built-in PDF-saving (it’s version 2007, btw), and created two versions, with and without automatic hyphenation. (I also switched on kerning in the hyphenated doc, but I couldn’t see that it had any effect…) For the LyX document, I used the XeTeX PDF exporter, since I wanted to use my favourite Windows font, Georgia.

Tree Comparison: Word

Tree Comparison: Word (Hyphenated)

Tree Comparison: LyX


LyX wasn’t as straightforward as I’d hoped, but now that I’ve sorted a few teething issues, I think I could use it in future with a minimum of fuss. As to the results, I think the LyX-based document looks the best; the Word version without hyphenation looks quite gappy, and the automatic hyphenation isn’t great. The LyX file has too much whitespace above the title for my liking, and I’d prefer gaps between the paragraphs rather than indentation, but I daresay I can find a different layout that I like better. The wider margins make the text more readable, but it is odd that “weighted” juts out into the right margin; I couldn’t figure out why, but I could fix it by adding {sloppypar} tags around that paragraph. (The LaTeX wikibook is excellent for figuring out stuff like that.)

Tree Comparison

Throughout the course of my degree I have found it useful to write summaries of the various aspects of phylogenetics and biology that I have learned. That these will be useful to others is perhaps a vain hope, in both senses of the word, but I thought I might as well publish some of them on my blog. (It also afforded me the chance to try my hand at LyX/LaTeX.) For your reading convenience, this post is available as a pdf pdf.

Tree Comparison
Phylogenetic trees have two properties that can usefully be compared, their topologies and their branch lengths. Usually, the desired outcome of a tree comparison is a single number, indicating how different the trees are from one another. Reducing multiple complex structures to a single interpretable digit is difficult, even when just comparing two trees; a range of methods have been developed, most of which use (sometimes implicitly) graph theoretical measures of distance. Note that this is different from the situation of tree evaluation, where the aim is to determine whether some trees are a better representation of evolutionary history than others. Tree comparison is often done after evaluation, to gauge how much credence and importance to give to the results of the evaluation (there is, as yet, no method to state formally that the difference between trees is significant).

Felsenstein (2004, pp.528-535) provides a historical overview of phylogenetic tree comparison, starting with the symmetric difference metric, also known as the Robinson-Foulds (RF) distance, which measures differences in topology between a pair of (possibly multifurcating) trees (Robinson and Foulds, 1981). The symmetric difference can be conceptualised as the minimum number of transformations that are required to convert one tree to the other, where a transformation corresponds to either removing a branch and merging the nodes it connected, or by splitting a node into two and inserting a branch between the new nodes. The symmetric difference is widely used, but can be highly sensitive; that is, it can have a high value for trees which are intuitively similar (Felsenstein, 2004).

Including information on branch lengths in tree comparisons is potentially useful, particularly when the tree has a relatively wide range of branch lengths. The weighted Robinson-Foulds distance (Robinson and Foulds, 1979) and the branch score (Kuhner and Felsenstein, 1994) are two metrics that use branch length information, and both are based on the symmetric difference. The weighted RF distance is the sum of the differences between corresponding branch lengths; a branch length is considered to be zero if it does not exist in one of the trees. The branch score is similar, but squares the differences before adding them, and the square root of this sum is named the branch-length distance (BLD) (Felsenstein, 2004).

The pair of trees being compared can be mapped to two points in tree space, which suggests another distance metric, the geodesic distance, defined as the shortest path between two points in tree space. In tree space, the weighted RF distance and the BLD correspond to Manhattan and Euclidean distances, respectively (Kupczok et al., 2008). Calculating the geodesic distance may be computationally prohibitive for large trees, but good approximations are available (Kupczok et al., 2008).

All of the distances that use branch lengths will produce relatively high values if the branches in one tree tend to be larger, even if the topologies are very similar; that is, if the evolutionary rate differs between the trees. This behaviour may or may not be desirable, so to prevent differences in rate from having a disproportionate effect, Kuhner and Felsenstein (1994) suggested using relative branch lengths, dividing each branch length by the sum of all branch lengths. As far as I am aware, this has not been implemented in any publicly available software. The K score is a modification of the BLD that scales one tree to have similar global divergence to the other before calculating the BLD, but the scaling means that the K score is no longer mathematically defined as a distance, and its use is not always appropriate (Soria-Carrasco et al., 2007).

Citing this Document
[If referring to this document, please cite its location on the Monkeyshines website:]


  • Felsenstein, J. (2004) Inferring Phylogenies. Sinauer, Sunderland, Massachusetts.
  • Kuhner,M.K. and Felsenstein, J. (1994) A simulation comparison of phylogeny algorithms under equal and unequal evolutionary rates. Molecular Biology and Evolution, 11, 459-468. Pubmed
  • Kupczok, A. et al. (2008) An exact algorithm for the geodesic distance between phylogenetic trees. Journal of Computational Biology, 15, 577-591. Pubmed
  • Robinson, D.F. and Foulds, L.R. (1979) Comparison of weighted labelled trees. Lecture Notes in Mathematics, 748, 119-126.
  • Robinson, D.F. and Foulds, L.R. (1981) Comparison of phylogenetic trees. Mathematical Biosciences, 53, 131-147.
  • Soria-Carrasco, V. et al. (2007) The Ktree score: quantification of differences in the relative branch length and topology of phylogenetic trees. Bioinformatics, 23, 2954-2956. Pubmed

The War-and-Peace-O-Meter

Some years ago I asked for War and Peace by Leo Tolstoy for Christmas; a new translation had just been published, and that seemed as good a reason as any to tackle this metaphorically and literally immense book. I do a lot of my reading on the bus, which has tended to put me off starting it, but I decided that I should stop procrastinating and get on with it. I’ve read the first few chapters, and I’m enjoying it – I was a bit worried that my historical ignorance might make the setting a bit confusing, but Tolstoy does a fine job of filling in the background, and there are a few judicious notes by the translator, Anthony Briggs (it’s the 2005 Penguin edition, by the way).

I thought it might be nice to visually track my progress, so, naturally, I built a War-and-Peace-O-Meter. I swiped the image from elsewhere, modified it a bit, then wrote some very simple CSS rules to do the “filling up”. For the War-and-Peace-O-Meter I only need a static level (and will be posting weekly updates), but to demonstrate how you can fill up the meter dynamically, you can change the percentage:


Geek Details

The image of the -o-meter is a gif with transparency in the middle. Behind that is a div with a white background, and behind that is a div with a red background. The height of the middle div is dynamically adjusted, revealing more or less of the red behind. This is a bit more complicated than absolutely necessary, but it’s easy to implement, and it’s flexible in that you don’t have to use a solid block of colour as the bottom layer, it could be an image that is gradually revealed. Like in pubs which have packets of nuts mounted on a bit of card, and which reveal a saucy lady as the packets are removed. No saucy ladies here though, this is a family-friendly website.


.ometer_coloured { background:crimson; width:90px; height:370px; } .ometer_mask { background:white; position:absolute; z-index:1; height:361px; width:90px; } #ometer { background:url(‘/img/ometer.gif’) no-repeat; position:absolute; z-index:10; width:90px; height:370px; }

<div class=’ometer_coloured’> <div id=’mask’ class=’ometer_mask’> <div id=’ometer’></div> </div> </div>

function updateMeter(ometer_mask_id, ometer_pc_id) { var pc = document.getElementById(ometer_pc_id).value; var numericExpression = /^[0-9\.]+$/; if (pc.match(numericExpression)){ var total_rows = 353; rows = Math.round(total_rows * (pc/100)); if (rows > total_rows) { rows = total_rows; } rows = 361 – rows; var mask = document.getElementById(ometer_mask_id) = rows+’px’; } else { alert(‘Numbers only, please.’); } }

T1DBase: gene models

I recently wrote an overview of T1DBase, an online resource for the type 1 diabetes (T1D) research community (Hulbert et al. 2007; Burren et al. 2011). I shall now describe one of the more interesting of my contributions to the project, calculating and graphically displaying gene models. In the month since I first started writing this document, the ‘Gene Models’ section of T1DBase has changed, to become a ‘Gene Overview’, which looks slightly different (that’ll teach me to take so long in writing this up). In particular, summary and consensus gene models are no longer displayed, and nor are gene models for older builds; these are still viewable via the T1DBase archive site. Here, I’ll describe the work that I did, rather than what appears on the current site. For your reading convenience, this post is available as a pdf pdf.

Genes and Gene Models
The first, and probably most difficult, step is to decide what a gene is in the first place. I started this work about 5 years ago, before the importance of non-protein-coding RNA was widely recognised, so the discussion here relates only to protein-coding genes, rather than RNA genes. I’m currently working on topics related to the evolution of RNA genes, so I think it’d be sensible to add them, which shouldn’t actually be too difficult. But anyway, how do we define a protein-coding gene? One answer might be “a section of DNA that is responsible for the creation of a functional protein”. But what if there are splice variants; which one do you choose, or do you merge them? And where do you get your information from? If from multiple sources, how do you deal with any conflicts in the delimitation of gene boundaries, or intron or UTR structure? And what if one source defines the same gene in multiple locations, perhaps on different chromosomes? You can deal with many of these issues by working with “gene models” rather than genes, where a gene model is a collection of structures for a single gene, from a single source. Figure 1 shows two gene models for the gene CTLA4, based on data from Ensembl and UCSC.

It is useful to have a common point of reference for gene models from different sources, and in T1DBase this is the RefSeq data from the NCBI. This ties in with what T1DBase considers to be a gene: in practical terms, something with an Entrez Gene ID. In previous versions of T1DBase, the NCBI gene is represented alongside each gene model in T1DBase, as a green box (Figure 1), but this is not shown in the current version of the site.

Figure 1. Gene models for CTLA4, from Ensembl and UCSC. Exons and introns are shown by boxes and connecting lines, respectively, and UTRs are highlighted in red. The green box displays the gene according to NCBI RefSeq data.

Gene Models in T1DBase
When I started on the T1Dbase project, the site already displayed gene models based on data from a number of sources. My task was two-fold: to make the graphical display clearer and prettier, and to find a way to usefully summarise gene models, both within and between different sources. At this point it would be remiss not to acknowledge the importance of discussions with, and advice from, my colleagues on the project, i.e. my fellow authors on the Hulbert et al. (2007) paper. Also, I used libraries from the BioPerl project, chiefly Bio::DB::GFF, Bio::SeqFeature and Bio::Graphics::Panel.

Gene models in T1DBase are shown on each gene page (e.g. the CTLA4 gene page). I won’t talk much about the graphical aspect of the work I did – I link to the code later, but it’s rather tied into the T1DBase code base to be of general use (although by all means contact me if you would like some more information or assistance).

It’s probably worth pointing out that T1DBase is not restricted to genes that are linked to type 1 diabetes, so if you are just after a nice display of gene models for a gene of interest, you can still use the T1DBase website. All of the NCBI genes are available, since you never know when a gene will be linked to diabetes (genes which have been linked will tend to have additional gene models). So if you want to look at gene models for any gene, based on data from 4 useful sources (CCDS, Ensembl, UCSC, and Vega), go to the T1DBase home page and type the gene name or ID into the search box in the top-right corner.

The most interesting aspect of the gene models in T1DBase relates to how multiple sources of data can eliminate spurious transcripts, and how data can be effectively summarised across all gene models.

Eliminating Incorrect Transcript Predictions
There are a few preparatory steps required before you can start weighing up transcripts from different sources, all of which are automated by a set of scripts that I wrote (and which continue to be maintained by the current T1DBase staff). These download the raw data for a given range of sources, species, and builds, convert them to GFF format, and load them into a Bio::DB::GFF database. One issue that arises with Ensembl and UCSC transcripts is that these sources assign Entrez Gene IDs to transcripts based on sequence similarity, irrespective of whether the gene and the transcript are at (roughly) the same location. This results in assignments which are wrong, so in order to sort out which Entrez Gene IDs go with which Ensembl/UCSC IDs, we cross-reference with RefSeq. A transcript is disregarded if it is on a different chromosome to the RefSeq-defined gene (chromosomes X and Y are considered to be the same chromosome); if it is on the same chromosome but a different strand; or if it is on the same chromosome and strand, but not within 100kb of the gene. This value of 100kb is chosen to be big enough to allow some variation, but to ignore infeasible large discrepancies; it is rather arbitrary, but works well in practice. (See the script for details.)

Having dealt with conflicts between the positions of transcripts within a data source, we turn our attention to comparisons across all of the data sources. We want to ascertain when a gene model from one or more sources is on a different chromosome or strand, or more than 1Mb distant (again, rather an arbitrary value for the distance, but sufficiently high that we are very unlikely to dismiss an accurate transcript). To start with, we need to decide on which sources we trust more than others (the discussion here relates to 4 sources, but the arguments are applicable to any sources). Ensembl and UCSC are comprehensive, but are largely automated; CCDS provides reliable positions for CDS regions but, by definition, no UTRs; and Vega is high-quality, hand-curated data. Using this information, for the transcripts with conflicting positions we apply the following logic to all of the transcripts for that gene:

  • If there is a single CCDS or Vega transcript, consider that to be the most likely position (termed the ‘tentative’ position), unless there are both CCDS and Vega transcripts, and they have conflicting positions.
  • For all sources, across all transcripts, count how many support each conflicting position.
  • If there are three or more conflicting positions, add the counts for all but the most supported position (MSP) together, so that we have two counts for comparison.
  • If the MSP is only supported by one transcript, mark all of the transcripts as ‘undecided’, as there is insufficient evidence to automatically resolve the conflict.
  • If the MSP has multiple transcripts, then compare it to the number of transcripts supporting other positions; if the ratio of the numbers is above a certain value, mark the MSP as ‘accepted’ and the unsupported transcripts as ‘rejected’; otherwise, mark all transcripts as ‘undecided’.
  • The ratio in the previous step depends on whether a tentative position from CCDS or Vega exists. If not, the ratio is 2:1, i.e. the number of transcripts for the MSP must be at least twice the number of all other transcripts. If the tentative position exists, and agrees with the MSP, the ratio is 1:1 (a simple majority); in the case of disagreement, the ratio is 3:1 (a lot of support is required to outweigh a hand-curated position).

Once this process is complete, there are then three groups of transcripts, those that can be confidently used or ignored (i.e. the ones marked as ‘accepted’ or ‘rejected’, respectively), and those that require caution (the ‘undecided’ set). This information is loaded into the Bio::DB::GFF database, to enable useful queries and further data modification. The script automates the entire process of building gene models, from downloading the data from different sources, to evaluating when they do and do not agree.

Summary and Consensus Gene Models
As mentioned above, summary and consensus gene models are not shown on the current version of T1DBase, but can be viewed on the archive version of T1DBase, e.g. CTLA4. So, having collated data on transcripts from multiple sources that are (approximately) in agreement, it is then useful to examine how much variation in the detail of those transcripts, for example, which exons are best supported. In ‘summary gene models’, the number of transcripts that support each base are plotted as a bar chart (Figure 2, top panel). Transcripts from different sources that have almost identical CDS regions (i.e. within 1 base in either direction) are shown as ‘consensus gene models’ (Figure 2, bottom panel); the bounds of the UTRs are allow to vary somewhat, proportional to the length of the gene. Hand-curated sources are considered sufficiently reliable to warrant a consensus gene model of their own. The summary and consensus models are calculated with Perl modules (see next section) that require some other T1DBase modules and configuration files. But the code is well commented and fairly generic, so I think only minor tweaks would be necessary to use it elsewhere; please let me know if you would like help in doing so.

Figure 2. Summary and consensus gene models for CTLA4. Exons and introns are shown by boxes and connecting lines, respectively. Darker shades of blue indicate greater support than lighter shades. The green box displays the gene span, which shows the bounds of the gene according to NCBI RefSeq data.

Gene Models Code
All of the source code behind T1DBase is available under the GNU GPL (see the website for details), from a sourceforge subversion repository. The gene models Perl scripts are still under active development, but the associated GeneModel Perl modules that I wrote are no longer being used. I have included a copy of the GeneModel modules on my website, plus versions of the scripts that I have modified to work with a Windows installation of MySQL and a slightly different set of data sources than the current T1DBase site. As mentioned above, the GeneModel modules are integrated into the site, and won’t work out of the box; please let me know if you want to use any of the functionality, and I would be delighted to help out.


  • Burren OS, Adlem EC, Achuthan P, Christensen M, Coulson RMR, Todd JA. 2011. T1DBase: update 2011, organization and presentation of large-scale data sets for type 1 diabetes research. Nucleic Acids Research 39(Database issue): D997-D1001. PubMed: 20937630
  • Hulbert EM, Smink LJ, Adlem EC, Allen JE, Burdick DB, Burren OS, Cassen VM, Cavnor CC, Dolman GE, Flamez D et al. 2007. T1DBase: integration and presentation of complex data for type 1 diabetes research. Nucleic Acids Research 35(Database issue): D742-746. PubMed: 17169983


The “department-sized grouping of researchers” among whom I work, the Computational and Evolutionary Biology (CEB) group at the University of Manchester, have started a blog about our journal club (at which I have presented in the past).


Photographic evidence that I did, indeed, attend the latest MASAMB conference. James At MASAMB

MASAMB XXI, in Vienna

I was pleased to be invited to present some of my work at the MASAMB conference in Vienna recently (11-12th April 2011). My talk was titled “Quantifying the effect of evolution and genomic alignment on de novo RNA gene prediction”. I have a similarly themed manuscript in preparation, so won’t go into detail here. But if you have any questions about the presentation, then please get in touch by . The abstract is included below the nice pics.

I enjoyed both the conference and the time I spent in Vienna, where I appreciated the architecture and culture, as well as the beer and the wildlife:

Quantifying the effect of evolution and genomic alignment on de novo RNA gene prediction

James Allen and Simon Whelan
University of Manchester, Faculty of Life Sciences, UK

Non-coding RNA is biologically important and linked to a range of diseases, yet there are potentially many RNA genes of unknown function that do not belong to characterized RNA families. The de novo prediction of RNA genes is difficult because they are heterogeneous, and tend to have fewer restrictions than protein-coding genes. These problems are compounded by the lack of independent datasets for benchmarking predictions. We identify alignments of genes from 9 well-defined RNA families in UCSC genomic data for 32 vertebrate species, totaling 296 gene regions, and use randomization to create appropriate negative-control datasets. We evaluate three popular prediction programs (CMfinder, EvoFold, and RNAz), testing their ability to detect RNA genes and accurately determine gene boundaries. Our results show that RNA genes in genomic alignments have different evolutionary characteristics to structurally-aware RNA alignments, affecting gene prediction, and provide practical suggestions for the development and application of prediction programs.

Whelan Lab Journal Club: Seemann et al. (2011)

Members of Simon Whelan’s lab at the University of Manchester participate in a regular journal club, where a paper with an evolutionary/phylogenetic slant is discussed. Here, I briefly describe a paper that I recently presented, and summarize our conclusions about the methodology and results. (I have attempted to represent the discussion and consensus of the group, but any inaccuracies are my own.) For your reading convenience, this post is available as a pdf pdf.

PETcofold: predicting conserved interactions and structures of two multiple alignments of RNA sequences. Stefan E. Seemann, Andreas S. Richter, Tanja Gesell, Rolf Backofen and Jan Gorodkin (2011) Bioinformatics 27: 2, 211-219. PubMed: 21088024
(Presented by James Allen, 31st March 2011)

The paper in a sentence: Multiple sequence alignments and methods of RNA structure prediction can be combined to detect interactions between non-coding RNA molecules, in order to better understand their function.

Background: The biological importance of ribosomal and transfer RNA (rRNA and tRNA) has long been recognised, but in the last 10 years other non-coding RNA molecules (ncRNA) have been shown to have a variety of biological roles. As with proteins, the structure of, and interactions between, ncRNA define their functions, which are generally poorly understood.

The paper in detail: There are many methods available for predicting RNA secondary structure, one of which (PETfold: Seemann, et al. 2008) forms the basis for this paper. Briefly, the idea of PETfold is to combine information about the thermodynamic folding potential with explicit evolutionary models that rely on the covariance of base pairs in multiple sequence alignments. Interactions between RNA molecules can be viewed as a relatively simple extension of the structure prediction of a single RNA molecule, if it assumed that the interactions are primarily defined by the same process of canonical base pairing that creates the stems in RNA structures.

The PETcofold program concatenates two alignments of RNA that are assumed to interact, then attempts to predict both the structure of each RNA and their interactions in a two-stage process. (The authors term this hierarchical folding, which perhaps implies the potential for more than two stages, but this does not seem to be possible in the current implementation.) In the first stage, the structure of each RNA is predicted independently, and base pairs that are reliably predicted (i.e. above a particular threshold) are marked as constrained. In the second stage, further pairs are predicted for the unconstrained bases, both within and between the two RNA molecules. Such a process allows for the modelling of pseudoknots, which may have an important role in the function of RNA molecules. Some members of our group questioned whether this process was a biologically plausible scenario of RNA folding and interacting, but until we know more about the mechanisms involved it is probably reasonable. In any case, there is no theoretical reason why the same software cannot constrain and unconstrain base pairing in a more complex manner, to model biology more realistically.

The evaluation of PETcofold is somewhat problematic, as there is no well-defined dataset of RNA-RNA interactions against which to test the program. The authors gather a set of 32 interactions, of 13 different bacterial ncRNA molecules, based on experimental evidence, and use this to test different parameters of the model. It appears that by allowing only the most reliable base pairs in the first stage, the interacting sites are predicted with optimal accuracy, with a mean MCC of around 0.5. The MCC indicates the trade-off between sensitivity and positive predictive value (PPV), but these measures are not provided, so it is not clear in which area the program performs well (e.g. an MCC of 0.5 could arise from sensitivity=1 and PPV=0.25; from sensitivity=0.25 and PPV=1; or somewhere in between). Evaluating structure and interaction prediction was even trickier, with only four examples of interactions in the literature where the structure of two interacting molecules was also known. And one of these has to be discarded as being probably incorrect, so although PETcofold performs better than other programs, this is a rather small set on which to base general conclusions.

The authors find that very few interacting sites in their dataset show evidence of the covariance that their model is designed to detect, and thus use simulations to demonstrate that the model can take advantage of covariance if it does exist. However, the manner in which they introduce covariance, by multiplying an underlying tree by a factor of up to 200, is perhaps not ideal; this simulates a large amount of evolution across all sites in the RNA, which may affect the results.

It is rather optimistically stated in the discussion that PETcofold could be used to predict interactions between the results of genomic screens for ncRNA; however, given that such scans return thousands of results, the requisite pairwise combinations of predicted RNAs will be prohibitively computationally expensive.

Journal club conclusion: The program PETcofold is a potentially useful way to predict canonical base-pairing between RNA molecules, using covariance information. It is not yet clear whether there is, in practice, sufficient detectable covariance in such interactions, but if not, the thermodynamic aspect of the program, and the application of hierarchical folding, may result in predictions that are as good as, or better than, more complicated programs, and in less time.

Seemann, S.E., Gorodkin, J. and Backofen, R. (2008) Unifying evolutionary and thermodynamic information for RNA folding of multiple alignments. Nucleic Acids Research 36, 6355-6362.