Archive for the ‘Journal Club’ Category

CEB Journal Club: Price et al. (2012)

Members of the Computational and Evolutionary Biology (CEB) group at the University of Manchester participate in a monthly journal club, where a paper of broad interest is discussed. Here, I briefly describe the paper and its context, and summarize our conclusions about the methodology and results presented. (I have attempted to represent the discussion and consensus of the group, but any inaccuracies are my own.) For your reading convenience, this post is available as a pdf pdf.

Cyanophora paradoxa genome elucidates origin of photosynthesis in algae and plants. Dana C. Price et al. (2012) Science 335: 843-847. PubMed: 22344442
(Presented by James Allen. on Pi Day, 14th March 2012)

The paper in a sentence: Red algae, green algae and land plants, and glaucophytes (i.e. Plantae) are monophyletic; the photosynthetic ability of all plants derives from a single primary cyanobacterial endosymbiosis in their common ancestor.

Background: Photosynthesis is possible in plants due to the presence of a plastid, the remnant of an ancient endosymbiosis between a eukaryotic cell and a cyanobacteria. Until recently it was believed that a single, primary, endosymbiosis occurred in the ancestor of all plants and algae, and analysis of the plastid genome confirmed this. However, some phylogenomic analyses, enabled by increasing volumes of sequence data, provide either weak or no support for the monophyly of Plantae, suggesting the possibility of multiple endosymbioses. Resolving these issues is interesting because it sheds light (pun intended) on the nature of the first photosynthetic algae, and illuminates (OK, I’ll stop now) the fascinating, billion-year-old, events which gave rise to, ultimately, the daffodils in my front garden.

The paper in detail: There are three divisions in the Plantae, two of which (red and green algae) have been well studied; until now, relatively little information has been available for glaucophytes, a species-poor group of algal protists that constitute the third division. Price et al. sequence the genome of a glaucophyte, Cyanophora paradoxa in an attempt to gain sufficient data to convincingly confirm or refute the monophyly of Plantae. To move onto the more interesting stuff, I assume that the sequencing was done sufficiently well to provide reliable data (the authors give enough detail in the supplementary material to justify this assumption).

From a set of almost 28,000 predicted proteins, almost 4,500 had prokaryotic or eukaryotic homologs and were of passable quality. The authors state that they “generated 4,445 maximum likelihood trees from the C. paradoxa proteins and found that >60% support a sister-group relationship between glaucophytes and red and/or green algae with a bootstrap value ≥90%”. There are, I believe, a few problems with this sentence, if my interpretation of Figure 1B (reprinted below) is correct. In the first column of the figure, a total of 417 is given (subsequent columns are subsets of this first column, so we can ignore those for now); this is the number of maximum likelihood (ML) trees which contain 3 or more phyla, and in which the branch that places the glaucophytes in a monophyletic group has a bootstrap value ≥90%. So, >60% of “4,445” trees do not support Plantae monophyly; >60% of 417 trees do. But, in fact, only 44 of these trees contain all three divisions of Plantae; 118 pair glaucophytes with either red or green algae and a further 112 trees show evidence of endosymbiotic gene transfer (EGT), and are assumed to support monophyly. The evidence of these last two sets is certainly consistent with monophyly, but is weaker than cases where all three divisions are present.

The approach taken here, to discard a large amount of data that does not meet a relatively arbitrary bootstrap criteria, seems wasteful. There are established methods for combining information in multiple gene trees to generate a species tree (‘supertree’ methods), and although these are not always straightforward to use, I would have expected at least some discussion on why they were not applied. Another option for phylogenomic analyses is the supermatrix approach, in which protein sequences are concatenated before tree inference. Supermatrices may not be effective if the proteins do not share approximately the same evolutionary history, i.e. if EGT or HGT has occurred; but since the authors are able to fairly confidently detect these events (e.g. Figure 1C and 1D in the paper), these proteins could have been excluded from the analysis. Even if the authors’ approach is taken to be valid, >60% support for Plantae monophyly is not terribly convincing (incidentally, it is curious that the authors understate their case here, since the support value is actually 66% – why round it down?).

A final issue with the analysis of ML trees (before we move into more positive territory) is the lack of detail in the description of ML tree inference. The authors state that they are ‘using phylogenomics’ and reference a previous paper, in which a similar analysis is done; but the materials and methods section in that paper lacks detail, and some of it clearly does not apply. A concrete example of why this matters: RAxML can generate bootstrap replicates in different ways, which often does not affect any conclusions, but might be important here, where the analysis relies heavily on a particular cut-off value; it probably doesn’t make a difference, but the reader lacks the information needed to appropriately interpret the results.

I’ve gone into some detail about one aspect of the paper (the bit most pertinent to my area of research), and I am aware that I have been rather critical; often, in posts such as this, there is an understandable tendency to hem and haw, and obliquely imply that ‘perhaps the authors might have considered this or that’ and so on. But, I think my criticisms are fair, and are probably more interesting to read than the impending paragraph about the remainder of the paper, in which more convincing evidence of monophyly is presented…

The authors describe a number of proteins that are essential to eukaryotic photosynthesis, and demonstrate that these strongly suggest a common origin for red algae, green algae, and glaucophytes. The biological underpinning of these arguments, based on relatively few gene trees, is far more persuasive than the preceding phylogenomic analysis. There are also some interesting details on the gain and loss of fermentative enzymes, which would be clarified further with a greater number of species in the analysis.

Journal club conclusion: In places, the paper lacked sufficient, unequivocal, detail about the methods for us to wholly trust their conclusions. Some lines of evidence, however, were quite convincing, and we tend to believe that the Plantae are indeed monophyletic. Wider discussion of phylogenomics revealed a growing distrust of the results of such analyses; different researchers can arrive at contradictory answers to the same question, depending on how a dataset is selected and the exact nature of the analysis. Phylogenomics is necessarily complex, which makes it crucial for researchers to be meticulous when describing their data and methodology, so that we are able to decide to accept or reject their conclusions.


The “department-sized grouping of researchers” among whom I work, the Computational and Evolutionary Biology (CEB) group at the University of Manchester, have started a blog about our journal club (at which I have presented in the past).

Whelan Lab Journal Club: Seemann et al. (2011)

Members of Simon Whelan’s lab at the University of Manchester participate in a regular journal club, where a paper with an evolutionary/phylogenetic slant is discussed. Here, I briefly describe a paper that I recently presented, and summarize our conclusions about the methodology and results. (I have attempted to represent the discussion and consensus of the group, but any inaccuracies are my own.) For your reading convenience, this post is available as a pdf pdf.

PETcofold: predicting conserved interactions and structures of two multiple alignments of RNA sequences. Stefan E. Seemann, Andreas S. Richter, Tanja Gesell, Rolf Backofen and Jan Gorodkin (2011) Bioinformatics 27: 2, 211-219. PubMed: 21088024
(Presented by James Allen, 31st March 2011)

The paper in a sentence: Multiple sequence alignments and methods of RNA structure prediction can be combined to detect interactions between non-coding RNA molecules, in order to better understand their function.

Background: The biological importance of ribosomal and transfer RNA (rRNA and tRNA) has long been recognised, but in the last 10 years other non-coding RNA molecules (ncRNA) have been shown to have a variety of biological roles. As with proteins, the structure of, and interactions between, ncRNA define their functions, which are generally poorly understood.

The paper in detail: There are many methods available for predicting RNA secondary structure, one of which (PETfold: Seemann, et al. 2008) forms the basis for this paper. Briefly, the idea of PETfold is to combine information about the thermodynamic folding potential with explicit evolutionary models that rely on the covariance of base pairs in multiple sequence alignments. Interactions between RNA molecules can be viewed as a relatively simple extension of the structure prediction of a single RNA molecule, if it assumed that the interactions are primarily defined by the same process of canonical base pairing that creates the stems in RNA structures.

The PETcofold program concatenates two alignments of RNA that are assumed to interact, then attempts to predict both the structure of each RNA and their interactions in a two-stage process. (The authors term this hierarchical folding, which perhaps implies the potential for more than two stages, but this does not seem to be possible in the current implementation.) In the first stage, the structure of each RNA is predicted independently, and base pairs that are reliably predicted (i.e. above a particular threshold) are marked as constrained. In the second stage, further pairs are predicted for the unconstrained bases, both within and between the two RNA molecules. Such a process allows for the modelling of pseudoknots, which may have an important role in the function of RNA molecules. Some members of our group questioned whether this process was a biologically plausible scenario of RNA folding and interacting, but until we know more about the mechanisms involved it is probably reasonable. In any case, there is no theoretical reason why the same software cannot constrain and unconstrain base pairing in a more complex manner, to model biology more realistically.

The evaluation of PETcofold is somewhat problematic, as there is no well-defined dataset of RNA-RNA interactions against which to test the program. The authors gather a set of 32 interactions, of 13 different bacterial ncRNA molecules, based on experimental evidence, and use this to test different parameters of the model. It appears that by allowing only the most reliable base pairs in the first stage, the interacting sites are predicted with optimal accuracy, with a mean MCC of around 0.5. The MCC indicates the trade-off between sensitivity and positive predictive value (PPV), but these measures are not provided, so it is not clear in which area the program performs well (e.g. an MCC of 0.5 could arise from sensitivity=1 and PPV=0.25; from sensitivity=0.25 and PPV=1; or somewhere in between). Evaluating structure and interaction prediction was even trickier, with only four examples of interactions in the literature where the structure of two interacting molecules was also known. And one of these has to be discarded as being probably incorrect, so although PETcofold performs better than other programs, this is a rather small set on which to base general conclusions.

The authors find that very few interacting sites in their dataset show evidence of the covariance that their model is designed to detect, and thus use simulations to demonstrate that the model can take advantage of covariance if it does exist. However, the manner in which they introduce covariance, by multiplying an underlying tree by a factor of up to 200, is perhaps not ideal; this simulates a large amount of evolution across all sites in the RNA, which may affect the results.

It is rather optimistically stated in the discussion that PETcofold could be used to predict interactions between the results of genomic screens for ncRNA; however, given that such scans return thousands of results, the requisite pairwise combinations of predicted RNAs will be prohibitively computationally expensive.

Journal club conclusion: The program PETcofold is a potentially useful way to predict canonical base-pairing between RNA molecules, using covariance information. It is not yet clear whether there is, in practice, sufficient detectable covariance in such interactions, but if not, the thermodynamic aspect of the program, and the application of hierarchical folding, may result in predictions that are as good as, or better than, more complicated programs, and in less time.

Seemann, S.E., Gorodkin, J. and Backofen, R. (2008) Unifying evolutionary and thermodynamic information for RNA folding of multiple alignments. Nucleic Acids Research 36, 6355-6362.

CEB Journal Club: Andam et al. (2010)

Members of the Computational and Evolutionary Biology (CEB) group at the University of Manchester participate in a monthly journal club, where a paper of broad interest is discussed. Here, I briefly describe the paper and its context, and summarize our conclusions about the methodology and results presented. (I have attempted to represent the discussion and consensus of the group, but any inaccuracies are my own.) For your reading convenience, this post is available as a pdf pdf.

Biased gene transfer mimics patterns created through shared ancestry. Cheryl P. Andam, David Williams, and J. Peter Gogarten (2010) PNAS 107: 23, 10679-10684. PubMed: 20495090
(Presented by James Allen at Jabez Clegg, 28th July 2010)

The paper in a sentence: The authors describe a specific case of a gene that makes a bacterial enzyme, which has been horizontally transferred between species in a biased manner, such that the molecular evidence resembles that of a gene transferred by descent from parent to offspring.

Background: Until relatively recently, genetic information was thought largely to have been transferred from parent to offspring, analogous to a branching tree structure. The applicability of this analogy for all forms of life is under debate, however, given the discovery of the extent of other mechanisms for gene transfer in bacteria and other single-celled organisms. Horizontal gene transfer (HGT) refers to the process where genetic data from one organism is transferred to another which is not necessarily related, nor even necessarily the same species; the prevalence of HGT calls into question not only the ‘tree of life’ metaphor (suggesting, perhaps, that a network analogy is more appropriate), but also the (already rather labile) concept of species.

The paper in detail: The authors present one key result, which is supplemented by evidence from three other sources which would not be convincing in isolation, but here provide valuable circumstantial support. The results are based on a particular enzyme, which has the important property (for this analysis) that it has two distinct types. The main result is that the tree in figure 1 in the paper, generated by looking solely at this enzyme, has two distinct sub-trees, representing each of the the two types. Each one of these sub-trees closely resembles the tree that most likely characterizes the vertical inheritance of genetic data, i.e. the ‘species tree’ in figure 2. It is not easy to quantify whether one tree structure resembles another, particularly with the number of species used here; the authors look at the distances along the tree branches that separate all pairs of species, which discards information about some of the tree structure, but does not prevent them from convincingly demonstrating that the sub-trees for each type resemble the species tree. Moreover, in the species tree, the species with the same type of enzyme are grouped together within broader groupings at the phylum or class level; i.e. there are patches of red and green branches (representing the two types) in figure 2. This is evidence for biased HGT because it shows that HGT occurs not in a random fashion, but more often between more closely related species.

Another line of evidence presented is that a scenario of gene gain and loss that would explain the trees is far less likely than one where some degree of HGT occurs; the authors gloss over the fact that this demonstrates that HGT, rather than biased HGT, has most likely occurred. Additionally, the genes that surround the enzyme’s gene are found to be similar for both types, which would not be the case if the genes were being repeatedly gained and lost; again, this is evidence for HGT, not necessarily biased HGT.

The final piece of supporting evidence comes via simulations of biased and unbiased HGT, which result in data that resembles the real data. Some of the choices for the simulations are questionable, in particular the modelling of reciprocal transfer events, meaning that genes from two species are swapped. This does not reflect the biological reality, where the transfer generally happens in one direction only. Also, an extreme bias is modelled, using an exponential function, so that transfers are likely to occur between only the most closely related species – this may well be realistic, but the use of this particular model is not justified by the authors. Finally, the unbiased and biased transfers are simulated sequentially, which was perhaps done as it is often easier to show that something is changing, rather than staying the same, but is an uncommon approach that makes it difficult to interpret the results.

Journal club conclusion: While not wholly convinced by some of the evidence presented, particularly the approach to simulation, we believe that the main conclusions of the paper are valid: in the case of this particular enzyme, the horizontal gene transfer is biased, such that transfer is more likely between more similar species, and thus the molecular data provides the same signal as transmission through vertical inheritance. It remains to be shown how widespread this phenomenon is; if HGT generally reinforces, rather than contradicts, vertical inheritance of genetic material, then the tree of life analogy may well be useful for practical purposes, even if does not reflect the true evolutionary history.