Monkeyshines

Archive for the ‘Coding’ Category

LyX: LaTeX, the Easy(ish) Way

Introduction

I’ve resisted learning TeX/LaTeX for years; I appreciate the principle, but I baulk at the idea of learning another language, particularly one which requires a bunch of opaque commands at the top of each file. I like a nice GUI, and Word, for all its faults, does give you lots of control if you use the styles correctly. And the EndNote Web plugin allows you to sort out referencing without too much pain.

And yet, and yet… LaTeX is the right way to do things, from a typesetting viewpoint, and also because it correctly places the emphasis on content. What’s a graphically minded chap to do? I recently discovered LyX, which claims to fill this gap in the market. I thought I’d see how easy it was to turn out a nice looking document with LyX by replicating a Word doc that I was intending to post on this blog (on tree comparison).

Citing References

To make life a bit easier for myself, I shelved sorting out a BibTeX library, and just copied the references from the Word doc. Getting the citations to appear correctly was trickier than I thought it should be; it was easy enough to change the document settings so that the bibliography was ‘natbib’ format. But then I had no control over the precise display of the citation, e.g. (Mackenzie, 2011) versus Mackenzie (2011). I resorted to writing the TeX tags directly, aided by an excellent Natbib reference sheet. But then I got some errors when compiling the LyX document in PDF format. After a bit of digging, I determined that the \usepackage{natbib} had disappeared from the LaTeX preamble – other than entering it manually, I couldn’t figure out how to get it back.

Inserting a Header

Getting a header was also somewhat convoluted, and again involved dipping a reluctant toe into grimy TeX-infested waters. I had an intermediate problem in that the help within LyX is not searchable, which seems odd; but the interwebs told me that I needed to switch on fancy headers in the page layout settings, and then amend the preamble again (grrr) with \lhead{}, \chead{}, and \rhead{Tree Comparison - James E. Allen}. You need the empty definitions to suppress default headings, which seems more annoying than useful.

Results

I used Word’s built-in PDF-saving (it’s version 2007, btw), and created two versions, with and without automatic hyphenation. (I also switched on kerning in the hyphenated doc, but I couldn’t see that it had any effect…) For the LyX document, I used the XeTeX PDF exporter, since I wanted to use my favourite Windows font, Georgia.

Tree Comparison: Word

Tree Comparison: Word (Hyphenated)

Tree Comparison: LyX

Verdict

LyX wasn’t as straightforward as I’d hoped, but now that I’ve sorted a few teething issues, I think I could use it in future with a minimum of fuss. As to the results, I think the LyX-based document looks the best; the Word version without hyphenation looks quite gappy, and the automatic hyphenation isn’t great. The LyX file has too much whitespace above the title for my liking, and I’d prefer gaps between the paragraphs rather than indentation, but I daresay I can find a different layout that I like better. The wider margins make the text more readable, but it is odd that “weighted” juts out into the right margin; I couldn’t figure out why, but I could fix it by adding {sloppypar} tags around that paragraph. (The LaTeX wikibook is excellent for figuring out stuff like that.)

The War-and-Peace-O-Meter

Some years ago I asked for War and Peace by Leo Tolstoy for Christmas; a new translation had just been published, and that seemed as good a reason as any to tackle this metaphorically and literally immense book. I do a lot of my reading on the bus, which has tended to put me off starting it, but I decided that I should stop procrastinating and get on with it. I’ve read the first few chapters, and I’m enjoying it – I was a bit worried that my historical ignorance might make the setting a bit confusing, but Tolstoy does a fine job of filling in the background, and there are a few judicious notes by the translator, Anthony Briggs (it’s the 2005 Penguin edition, by the way).

I thought it might be nice to visually track my progress, so, naturally, I built a War-and-Peace-O-Meter. I swiped the image from elsewhere, modified it a bit, then wrote some very simple CSS rules to do the “filling up”. For the War-and-Peace-O-Meter I only need a static level (and will be posting weekly updates), but to demonstrate how you can fill up the meter dynamically, you can change the percentage:

%

Geek Details

Method:
The image of the -o-meter is a gif with transparency in the middle. Behind that is a div with a white background, and behind that is a div with a red background. The height of the middle div is dynamically adjusted, revealing more or less of the red behind. This is a bit more complicated than absolutely necessary, but it’s easy to implement, and it’s flexible in that you don’t have to use a solid block of colour as the bottom layer, it could be an image that is gradually revealed. Like in pubs which have packets of nuts mounted on a bit of card, and which reveal a saucy lady as the packets are removed. No saucy ladies here though, this is a family-friendly website.

Image:
ometer.gif

CSS:
.ometer_coloured { background:crimson; width:90px; height:370px; } .ometer_mask { background:white; position:absolute; z-index:1; height:361px; width:90px; } #ometer { background:url(‘/img/ometer.gif’) no-repeat; position:absolute; z-index:10; width:90px; height:370px; }

HTML:
<div class=’ometer_coloured’> <div id=’mask’ class=’ometer_mask’> <div id=’ometer’></div> </div> </div>

JS:
function updateMeter(ometer_mask_id, ometer_pc_id) { var pc = document.getElementById(ometer_pc_id).value; var numericExpression = /^[0-9\.]+$/; if (pc.match(numericExpression)){ var total_rows = 353; rows = Math.round(total_rows * (pc/100)); if (rows > total_rows) { rows = total_rows; } rows = 361 – rows; var mask = document.getElementById(ometer_mask_id) mask.style.height = rows+’px’; } else { alert(‘Numbers only, please.’); } }

T1DBase: gene models

I recently wrote an overview of T1DBase, an online resource for the type 1 diabetes (T1D) research community (Hulbert et al. 2007; Burren et al. 2011). I shall now describe one of the more interesting of my contributions to the project, calculating and graphically displaying gene models. In the month since I first started writing this document, the ‘Gene Models’ section of T1DBase has changed, to become a ‘Gene Overview’, which looks slightly different (that’ll teach me to take so long in writing this up). In particular, summary and consensus gene models are no longer displayed, and nor are gene models for older builds; these are still viewable via the T1DBase archive site. Here, I’ll describe the work that I did, rather than what appears on the current site. For your reading convenience, this post is available as a pdf pdf.

Genes and Gene Models
The first, and probably most difficult, step is to decide what a gene is in the first place. I started this work about 5 years ago, before the importance of non-protein-coding RNA was widely recognised, so the discussion here relates only to protein-coding genes, rather than RNA genes. I’m currently working on topics related to the evolution of RNA genes, so I think it’d be sensible to add them, which shouldn’t actually be too difficult. But anyway, how do we define a protein-coding gene? One answer might be “a section of DNA that is responsible for the creation of a functional protein”. But what if there are splice variants; which one do you choose, or do you merge them? And where do you get your information from? If from multiple sources, how do you deal with any conflicts in the delimitation of gene boundaries, or intron or UTR structure? And what if one source defines the same gene in multiple locations, perhaps on different chromosomes? You can deal with many of these issues by working with “gene models” rather than genes, where a gene model is a collection of structures for a single gene, from a single source. Figure 1 shows two gene models for the gene CTLA4, based on data from Ensembl and UCSC.

It is useful to have a common point of reference for gene models from different sources, and in T1DBase this is the RefSeq data from the NCBI. This ties in with what T1DBase considers to be a gene: in practical terms, something with an Entrez Gene ID. In previous versions of T1DBase, the NCBI gene is represented alongside each gene model in T1DBase, as a green box (Figure 1), but this is not shown in the current version of the site.

Figure 1. Gene models for CTLA4, from Ensembl and UCSC. Exons and introns are shown by boxes and connecting lines, respectively, and UTRs are highlighted in red. The green box displays the gene according to NCBI RefSeq data.

Gene Models in T1DBase
When I started on the T1Dbase project, the site already displayed gene models based on data from a number of sources. My task was two-fold: to make the graphical display clearer and prettier, and to find a way to usefully summarise gene models, both within and between different sources. At this point it would be remiss not to acknowledge the importance of discussions with, and advice from, my colleagues on the project, i.e. my fellow authors on the Hulbert et al. (2007) paper. Also, I used libraries from the BioPerl project, chiefly Bio::DB::GFF, Bio::SeqFeature and Bio::Graphics::Panel.

Gene models in T1DBase are shown on each gene page (e.g. the CTLA4 gene page). I won’t talk much about the graphical aspect of the work I did – I link to the code later, but it’s rather tied into the T1DBase code base to be of general use (although by all means contact me if you would like some more information or assistance).

It’s probably worth pointing out that T1DBase is not restricted to genes that are linked to type 1 diabetes, so if you are just after a nice display of gene models for a gene of interest, you can still use the T1DBase website. All of the NCBI genes are available, since you never know when a gene will be linked to diabetes (genes which have been linked will tend to have additional gene models). So if you want to look at gene models for any gene, based on data from 4 useful sources (CCDS, Ensembl, UCSC, and Vega), go to the T1DBase home page and type the gene name or ID into the search box in the top-right corner.

The most interesting aspect of the gene models in T1DBase relates to how multiple sources of data can eliminate spurious transcripts, and how data can be effectively summarised across all gene models.

Eliminating Incorrect Transcript Predictions
There are a few preparatory steps required before you can start weighing up transcripts from different sources, all of which are automated by a set of scripts that I wrote (and which continue to be maintained by the current T1DBase staff). These download the raw data for a given range of sources, species, and builds, convert them to GFF format, and load them into a Bio::DB::GFF database. One issue that arises with Ensembl and UCSC transcripts is that these sources assign Entrez Gene IDs to transcripts based on sequence similarity, irrespective of whether the gene and the transcript are at (roughly) the same location. This results in assignments which are wrong, so in order to sort out which Entrez Gene IDs go with which Ensembl/UCSC IDs, we cross-reference with RefSeq. A transcript is disregarded if it is on a different chromosome to the RefSeq-defined gene (chromosomes X and Y are considered to be the same chromosome); if it is on the same chromosome but a different strand; or if it is on the same chromosome and strand, but not within 100kb of the gene. This value of 100kb is chosen to be big enough to allow some variation, but to ignore infeasible large discrepancies; it is rather arbitrary, but works well in practice. (See the resolve_id_ambiguities.pl script for details.)

Having dealt with conflicts between the positions of transcripts within a data source, we turn our attention to comparisons across all of the data sources. We want to ascertain when a gene model from one or more sources is on a different chromosome or strand, or more than 1Mb distant (again, rather an arbitrary value for the distance, but sufficiently high that we are very unlikely to dismiss an accurate transcript). To start with, we need to decide on which sources we trust more than others (the discussion here relates to 4 sources, but the arguments are applicable to any sources). Ensembl and UCSC are comprehensive, but are largely automated; CCDS provides reliable positions for CDS regions but, by definition, no UTRs; and Vega is high-quality, hand-curated data. Using this information, for the transcripts with conflicting positions we apply the following logic to all of the transcripts for that gene:

  • If there is a single CCDS or Vega transcript, consider that to be the most likely position (termed the ‘tentative’ position), unless there are both CCDS and Vega transcripts, and they have conflicting positions.
  • For all sources, across all transcripts, count how many support each conflicting position.
  • If there are three or more conflicting positions, add the counts for all but the most supported position (MSP) together, so that we have two counts for comparison.
  • If the MSP is only supported by one transcript, mark all of the transcripts as ‘undecided’, as there is insufficient evidence to automatically resolve the conflict.
  • If the MSP has multiple transcripts, then compare it to the number of transcripts supporting other positions; if the ratio of the numbers is above a certain value, mark the MSP as ‘accepted’ and the unsupported transcripts as ‘rejected’; otherwise, mark all transcripts as ‘undecided’.
  • The ratio in the previous step depends on whether a tentative position from CCDS or Vega exists. If not, the ratio is 2:1, i.e. the number of transcripts for the MSP must be at least twice the number of all other transcripts. If the tentative position exists, and agrees with the MSP, the ratio is 1:1 (a simple majority); in the case of disagreement, the ratio is 3:1 (a lot of support is required to outweigh a hand-curated position).

Once this process is complete, there are then three groups of transcripts, those that can be confidently used or ignored (i.e. the ones marked as ‘accepted’ or ‘rejected’, respectively), and those that require caution (the ‘undecided’ set). This information is loaded into the Bio::DB::GFF database, to enable useful queries and further data modification. The generate_gene_models.pl script automates the entire process of building gene models, from downloading the data from different sources, to evaluating when they do and do not agree.

Summary and Consensus Gene Models
As mentioned above, summary and consensus gene models are not shown on the current version of T1DBase, but can be viewed on the archive version of T1DBase, e.g. CTLA4. So, having collated data on transcripts from multiple sources that are (approximately) in agreement, it is then useful to examine how much variation in the detail of those transcripts, for example, which exons are best supported. In ‘summary gene models’, the number of transcripts that support each base are plotted as a bar chart (Figure 2, top panel). Transcripts from different sources that have almost identical CDS regions (i.e. within 1 base in either direction) are shown as ‘consensus gene models’ (Figure 2, bottom panel); the bounds of the UTRs are allow to vary somewhat, proportional to the length of the gene. Hand-curated sources are considered sufficiently reliable to warrant a consensus gene model of their own. The summary and consensus models are calculated with Perl modules (see next section) that require some other T1DBase modules and configuration files. But the code is well commented and fairly generic, so I think only minor tweaks would be necessary to use it elsewhere; please let me know if you would like help in doing so.

Figure 2. Summary and consensus gene models for CTLA4. Exons and introns are shown by boxes and connecting lines, respectively. Darker shades of blue indicate greater support than lighter shades. The green box displays the gene span, which shows the bounds of the gene according to NCBI RefSeq data.

Gene Models Code
All of the source code behind T1DBase is available under the GNU GPL (see the website for details), from a sourceforge subversion repository. The gene models Perl scripts are still under active development, but the associated GeneModel Perl modules that I wrote are no longer being used. I have included a copy of the GeneModel modules on my website, plus versions of the scripts that I have modified to work with a Windows installation of MySQL and a slightly different set of data sources than the current T1DBase site. As mentioned above, the GeneModel modules are integrated into the site, and won’t work out of the box; please let me know if you want to use any of the functionality, and I would be delighted to help out.

References

  • Burren OS, Adlem EC, Achuthan P, Christensen M, Coulson RMR, Todd JA. 2011. T1DBase: update 2011, organization and presentation of large-scale data sets for type 1 diabetes research. Nucleic Acids Research 39(Database issue): D997-D1001. PubMed: 20937630
  • Hulbert EM, Smink LJ, Adlem EC, Allen JE, Burdick DB, Burren OS, Cassen VM, Cavnor CC, Dolman GE, Flamez D et al. 2007. T1DBase: integration and presentation of complex data for type 1 diabetes research. Nucleic Acids Research 35(Database issue): D742-746. PubMed: 17169983

Book Battle: Fathers and Sons vs Eve Green

I recently read Eve Green, by Susan Fletcher, on the strength of it garnering glowing reviews and having won a major award. I was utterly underwhelmed, and looked to Amazon to see what real people thought of it. Opinion is divided: it’s either a beautiful, mysterious evocation of Wales or a dull trudge through an unlikeable, self-involved character’s tedious past. I fall squarely in the latter camp, but I wondered if that was partly because I was so smitten with Fathers and Sons, by Ivan Turgenev, which I read immediately before.

So, I’m pitting the two books against each other, but I decided that it wasn’t fair to choose the battleground myself; I want a good clean fight, here. Being the geek that I am, I had no trouble knocking up a simple random word generator, to decide on the categories on which each book shall be judged. I used the ‘All Adjectives’ word list to generate 5 random categories.

Round 1: Clear
A tricky start for Eve Green, as it comes out flailing with some oblique references and flowery language; for some people these are it’s strengths, but Fathers and Sons lands a stinging blow with a title that tells you exactly what to expect, followed up by a flurry of descriptive passages that are the epitome of clarity. Verdict: Fathers and Sons.

Round 2: Immense
Interpreting ‘immense’ literally, both books move on the defensive, as neither will break a toe if dropped on a foot. A few cagey jabs later, and this damp squib of a round is over. Verdict: Draw.

Round 3: Fluffy
Fathers and Sons is on the ropes, reeling from an unprecedented attack by a cuddly toy dog from Eve Green, but it rallies towards the end of the round as Eve Green‘s darker heart asserts itself. The spectre of death haunts both of these distinctively un-fluffy novels, and it’s another tied round. Verdict: Draw.

Round 4: Curious
After a quiet couple of rounds Fathers and Sons gradually builds up a strong sequence of curious punches: smack – inter-generational dynamics; smack – our place in the universe; smack – frustrated desire. Eve Green is curious about human nature on a smaller scale, and counters with a few hits of loneliness and the nature of evil, but now looks like a broken book. Verdict: Fathers and Sons.

Round 5: Wandering
The episodic nature of Fathers and Sons comes out swinging in this round, but its attack weakens as it becomes clear that the trajectory of Bazarov’s fate has been far from aimless. Eve Green takes advantage with a few time-travelling blows, finishing with a pointless and devastating granny’s-dead-Cornish-sailor of an uppercut. Verdict: Eve Green.

The winner: Father and Sons. A victory for both literature and websites with random word generators.

T1DBase: type 1 diabetes, and my part in its downfall

Apropos of a new T1DBase publication (Burren et al. 2011) (in which I am kindly acknowledged), I thought I’d write a bit about some of the work I did there (Hulbert et al. 2007). I envisage this being the first of maybe three instalments, so before going into detail about the specific projects that I worked on, I’ll explain what T1DBase actually is, and why I’m proud to have worked on the project. For your reading convenience, this post is available as a pdf pdf.

T1DBase is a resource for the type 1 diabetes (T1D) research community, and it has strong ties to the JDRF/WT Diabetes and Inflammation Laboratory (DIL) in Cambridge, which is headed up by John Todd. (When I worked at the DIL we collaborated with the ISB and a group at UPenn, but this is no longer the case.) Type 1 diabetes is an auto-immune disease, that primarily manifests in childhood, so was formerly known as juvenile diabetes. The symptoms are similar to those of type 2 diabetes, but the aetiology is quite different (Todd 2010), and type 1 diabetes is genetically more similar to diseases like rheumatoid arthritis and coeliac disease (Smyth et al. 2008).

I worked on T1DBase for three years, from Jan 2006 to Dec 2008, which was a period of massive change in our understanding of the genetics of type 1 diabetes, primarily due to the emergence of genome-wide association studies (GWAS). The DIL was heavily involved in one of the first landmark studies (Todd et al. 2007; Wellcome Trust Case Control Consortium 2007), as part of the WTCCC (Wellcome Trust Case Control Consortium; don’t worry, I think that’s the last of the acronyms). Results from that and subsequent GWAS (e.g. Cooper et al. 2008; Barrett et al. 2009) generated a host of new T1D susceptibility regions, and a better (although still far-from-complete) appreciation of the genetics of this complex disease. (I’ve cited GWAS publications that I was involved in, or that were written by colleagues at the DIL, but T1DBase also gets data from a range of other sources; see the website for more information.)

The people behind T1DBase curate the GWAS results, and make them available as raw data and, more usefully, as region summaries that tie to analyses of genes and variants (i.e. SNPs), as well as cross-referencing with mouse and rat data. It sounds so simple when you write a sentence like that, but there are, of course, very many challenges involved, both in terms of making sense of a huge amount of biological data, and in working out how to effectively present the results. And that’s not to mention the day-to-day work of maintaining a website, and programming collaboratively and efficiently. I very much enjoyed working on the T1DBase project; I learnt loads, both about disease genetics and programming, and it was always a fun environment to work in (with regular tea breaks, too…) And it was nice to be in a job where, in some small way, I was able to constructively contribute to important and useful research into type 1 diabetes.

References

  • Barrett JC, Clayton DG, Concannon P, Akolkar B, Cooper JD, Erlich HA, Julier C, Morahan G, Nerup J, Nierras C et al. 2009. Genome-wide association study and meta-analysis find that over 40 loci affect risk of type 1 diabetes. Nature Genetics 41(6): 703-707. PubMed: 19430480
  • Burren OS, Adlem EC, Achuthan P, Christensen M, Coulson RMR, Todd JA. 2011. T1DBase: update 2011, organization and presentation of large-scale data sets for type 1 diabetes research. Nucleic Acids Research 39(Database issue): D997-D1001. PubMed: 20937630
  • Cooper JD, Smyth DJ, Smiles AM, Plagnol V, Walker NM, Allen JE, Downes K, Barrett JC, Healy BC, Mychaleckyj JC et al. 2008. Meta-analysis of genome-wide association study data identifies additional type 1 diabetes risk loci. Nature Genetics 40(12): 1399-1401. PubMed: 18978792
  • Hulbert EM, Smink LJ, Adlem EC, Allen JE, Burdick DB, Burren OS, Cassen VM, Cavnor CC, Dolman GE, Flamez D et al. 2007. T1DBase: integration and presentation of complex data for type 1 diabetes research. Nucleic Acids Research 35(Database issue): D742-746. PubMed: 17169983
  • Smyth DJ, Plagnol V, Walker NM, Cooper JD, Downes K, Yang JHM, Howson JMM, Stevens H, McManus R, Wijmenga C et al. 2008. Shared and distinct genetic variants in type 1 diabetes and celiac disease. The New England Journal of Medicine 359(26): 2767-2777. PubMed: 19073967
  • Todd JA. 2010. Etiology of type 1 diabetes. Immunity 32(4): 457-467. PubMed: 20412756
  • Todd JA, Walker NM, Cooper JD, Smyth DJ, Downes K, Plagnol V, Bailey R, Nejentsev S, Field SF, Payne F et al. 2007. Robust associations of four new chromosome regions from genome-wide analyses of type 1 diabetes. Nature Genetics 39(7): 857-864. PubMed: 17554260
  • Wellcome Trust Case Control Consortium. 2007. Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature 447(7145): 661-678. PubMed: 17554300

Countdown Timer – JavaScript

Countdown to :

The JavaScript demonstrated here (countdown.js) can countdown to a particular date. If the date is annually recurring (e.g. a birthday), then it’ll countdown to the next occurence; if it’s a specific year (e.g. a holiday, in the vacation sense of the word), when the date has passed it’ll show the time elapsed since that date. It counts down to a fraction of a second after midnight on the given day (JavaScript gets it’s time information from the client’s clock).

This isn’t particularly novel, and much of the code is copied, merged, and adapted from other similar scripts on the web (none of which did exactly what I wanted). I haven’t seen another script that takes a date and automatically works out whether it’s annually recurring, and if not, whether to count down or up, but I daresay it’s been done many times over. The script is heavily dependent on giving it a date in the right format, “Month Day_of_Month[, Year]”, where Month should be specified as text to avoid any confusion about the order of date components. Not an issue with a list box, but if you let users enter data in text boxes you’ll have to do a bunch of checking and formatting, which is altogether too tedious for me to have bothered with.

I don’t think the script needs much commentary – it works out when the day is in relation to today and does some simple maths to display that information in a human-readable format. JavaScript works in milliseconds, which is why we divide by 1000 in various places. The script is actually pretty wordy because I find JavaScript counter-intuitive, and tend towards clarity rather than brevity; but if you prefer the latter it’d be easy to condense it.

It’s easy to add a little pizazz to the countdown by displaying a picture relevant to the date selected, just change the src of the image when the user selects from the list box.