Tidying up edgeR and differential expression

For anyone who has done any differential expression from RNA array/ sequencing datasets in R, edgeR or DESeq2 are the packages to go to! They contain powerful functions and models which cater for most uses.

As with most bioinformatics type packages, they end up creating scripts with endless lines of code and square brackets aplenty. Now not only does this make scripts difficult to read, it’s difficult for R to handle in some aspects. Quite often these packages have custom objects etc. which makes changing formats and sending data into different packages a pain.

Well, if you do lots of differential expression, here’s some wrappers for you! Taking inspiration from the FinalFit and tidyverse approach,
I have written a few wrappers for differential expression analyses to make your code and life happier.

Please note – throughout this all I use count data as this is what edgeR likes, rather than FPKM/RPKM or RSEM.

Step 1 – Taking the first steps to expression happiness

So – first up, preparing and filtering your data. This function turns your data and any clinical/ sample data, wraps it up into a DGEList object,
then will filter it. I like to function based on proportions of lowly expressed transcripts, as purely filtering on arbitary CPM values has its own issues, particularly if your read depth is low.

tidy_dge() is a function which does all of this! It combines two functions dge_bind() – which sets up your DGEList based on two data frames and tidy_gene_filter() which then filters these objects and calculates normalisation values.

We can then combine this into one function tidy_dge(), which shrinks about 10-20 lines into one.

Step 2 – Translating annotations

Next up – convert_genes(). Switching between gene annotation systems (i.e. ENSEMBL, HUGO, ENTREZ) is a pain. We all know that. The package biomaRt tries to solve this a bit, but still we’re left with somewhat unwieldly functions and lines of code trying to pick out where genes sit and what to map them back to. For this, I have devised convert_gene().

convert_gene() passes a request to biomaRt (the ENSEMBL API) and will convert any (common) gene name to any other gene annotation out. It takes a little while to contact the servers, but is well worth it! Alternatively, you will be able to pass a saved biomaRt output to this, so you can make one pull only which speeds it up! If we are to do this, we should create three separate marts – one containing genes, one genes and GO codes, the other the GO codes and processes. Here’s some functions to do this:

The reason for this- in testing some of the marts have errors when being pulled with genes and GO codes all at once! The ensembl biomaRt API is also somewhat unstable and frequently produces 404 errors/ lots of downtime. Plus – we can then use the mart we created for genes in convert_gene().

Back to convert_gene()! It is designed to work with a glm object straight out of edgeR or with a dataframe of counts, but will detect if it’s another type of file. Just be careful! As the default setting is mouse currently:

GO – The final step to expression fulfillment

Finally – those GO analyses, how on earth do you then trace back and find which genes in which pathways are up-regulated? I’ve never been able to find a nice function to do this.

Fear not – we have a new one! gene_GO_explorer(). This is very handy indeed. It basically will take your glm object from edgeR and map the genes to GO processes and codes.

There are some functions wrapped up in this function for dissecting pathways out too; go_grep, go_code, go_pathway and go_components. These allow you to supply a list of GO processes, GO codes or to search the GO annotations for pathways of interest. For example, If your GO enrichment shows pathway activity for a specific GO pathway, you can input the GO code into go_code to get the genes involved in that pathway. For searching GO pathway annotations, we can use go_grep. Here, if we set go_grep to ‘stress’, gene_GO_explorer() will search and return any pathways with the term ‘stress’ in their description.

If genes are part of multiple pathways that are included in our search, the GO terms will be collapsed down together – separated by a semicolon.
This data can be displayed in a ‘long’ format if required by specifying long = TRUE.

Note – unlike goana, gene_GO_explorer() does not mind what type of gene annotation is supplied! I use convert_gene() to prepare my glms for goana then pop it through gene_GO_explorer().

If you’re not using a glm from edgeR – switch off the adjusted p-value bit of gene_GO_explorer() using adjust_p = FALSE, or set the column with your p-values to ‘PValues’.

I’ll be developing more of this sort of thing, in addition to some nicer wrappers for preparing your sequencing data straight from FASTQ formats, making user readable edgeR models and then querying protein interaction databases too! Comments/ thoughts appreciated if people find this kind of stuff useful. Considering making a package. FinalOmics anyone?

Leave a Reply

Your email address will not be published. Required fields are marked *