Reordering factor levels in R and what could go wrong

This post was originally published here

I’ve recently started using ggplot2 in addition to lattice (see this post that I made a while ago, explaining how I got into using lattice in the first place). Hint: when using ggplot2, you’ll need to use of the reshape2 package (also written by the amazing Hadley Wickham) to get your data into a form that ggplot2 works best with. Another thing that you’ll want to think about when using ggplo2 is factor levels. This post will show how to (and also how not to) rearrange factor levels in R.

Let’s create a quick barplot with strings as x labels.


#create dummy data
a = paste('my', 1:11)
b = 1:11
df = data.frame(a, b)
qplot(a, b, data=df, geom='bar', stat='identity') +
  theme(axis.text=element_text(size=16, angle=45))


As df$a is an array of strings, R sets the factor levels alphabetically: my 1, my 10, my 11, my 2…which is not what we want, so let’s rearrange factor levels:

df$a = factor(df$a, levels = paste('my', 1:11))
qplot(a, b, data=df, geom='bar', stat='identity') +
  theme(axis.text=element_text(size=16, angle=45))


And finally, the wrong way to rearrange factor levels would be by using the levels() function:

df = data.frame(a, b)
levels(df$a) = paste('my', 1:11)
qplot(a, b, data=df, geom='bar', stat='identity') +
  theme(axis.text=element_text(size=16, angle=45))


So be careful – if your data is not as obvious as this example and you are a bit new to factors and levels, you might end up plotting wrong results (like on the last example, “my 2” and “my 3” were plotted with the values 10 and 11).

Latex tables: column widths and alignments

This post was originally published here

Firstly, start off your table in

Tables Generator will do a lot for you. Its most useful features are importing from .csv and merging cells. The Booktabs table style (alternative to default table style from the menu) looks a bit nicer and is “publication quality”. Note that publication quality tables should not contain vertical lines.

Screen shoti of Tables Generator

Screen shot of Tables Generator


Code #1 is the code from Tables Generator with the addition of caption, label and Latex document begin-end (so it’s compilable). Continuing from that table, let’s centre the contents of columns 1-3 and the whole table in your document, by adding centering and changing the table specs from l’s to c’s: Code #2.


Finally, if your cell contents are long and need wrapping:

table 3

Note that if your table is too wide for your document margins, then LaTex issues a warning, not an error. So you need check for warnings like “Overfull hbox (9.75735pt too wide) in paragraph at lines 55–63” in your compilation log. A quick solution to wide cells is like this (Code#4):



But this solution does not include decent central alignment. Using m (so m{2cm} instead of p{2cm}) would do the vertical centering (e.g. look how the first row is alligned), but still not horizontal. So following this StackOverflow post, I started defining column types and widths using the array package. See Code#5.



Next time I might write a post on how to add extra space between lines.

Why does a linear model without an intercept (forced through the origin) have a higher R-squared value? [calculated by R]

This post was originally published here

This is a short note based on this.

Answer in short: Because different formulas are used to calculate the R-squared of a linear regression, depending on whether it has an intercept or not.

R2 for a linear model that has an intercept:


where y is the variable that the linear model is trying to predict (the response variable), y^ is the predicted value and y- is the mean value of the response variable.

And the R2 for a linear model that is forced through the origin:

CodeCogsEqn (2),

basically the mean value of the response variable is removed from the equation, making the denominator bigger (and the result of the division smaller). The reason why the mean can not be used for this calculation is that it does not make sense any more – forcing the fit through zero kind of means adding an infinite number of (0,0) points into the dataset.

This means that the R-squared values of two different linear models (one with an intercept, one without) can not really be compared, because when the intercept is quite small compared to the residuals (basically the numerator) then the R2 “advantange” that the through-origin regression gets is relatively bigger than the decrease in residuals, when including the intercept.

Symbolic links and 2 common errors with them

This post was originally published here

I don’t know if it’s good or bad, but I like when the files I’m working with are in the working directory (so instead of using pathnames to my files I can just type filename or ./filename). But to avoid copying data and wasting space, symbolic links are the way to go. The command for that is:

ln -s target_file sym_link,

where -s stands for “symbolic” (just ln would create a hard link)

However, if you are not a complete UNIX guru, then trying to access your linked files is likely to produce one of these errors:

No such file or directory OR Too many levels of symbolic links

The solution to both of these is to always use full paths to the files and their symbolic links (ln -s /home/folder/file.txt /home/folder2/file.txt). For further information, see this and this. Apparently you can have 32 levels of symbolic links, so getting a “Too many levels of symbolic links” after just creating one, means that there is some serious recursion going on.

Remove symbolic links just as you remove files:

rm sym_link

Saving some variables from a netCDF to a new file

This post was originally published here

The NCO (netCDF Operator) command ncks (netCDF Kitchen Sink).

From the documentation:

The nickname “kitchen sink” is a catch-all because ncks combines most features of ncdump and nccopy with extra features to extract, hyperslab, multi-slab, sub-set, and translate into one versatile utility. ncks extracts (a subset of the) data from input-file and and writes (or pastes) it in netCDF format to output-file, and optionally writes it in flat binary format to binary-file, and optionally prints it to screen.


ncks extracts (and optionally creates a new netCDF file comprised of) only selected variables from the input file (similar to the old ncextr specification). Only variables and coordinates may be specifically included or excluded—all global attributes and any attribute associated with an extracted variable are copied to the screen and/or output netCDF file.

The flag for extracting variables is -v (followed by variable name(s) separated by commas):

ncks -v var1,var2 no space after the comma!

In case you’ve forgotten what the names of your variables are, do:

ncdump -h

-h prints headers only (and not the values). I usually direct the output of ncdump to a text file:

ncdump -h > ncdump.txt

Also, if you forgot some of the variables that you wanted then you don’t have to do the whole list again – NCO is always willing to append variables. So if you run:

ncks -v var3

but the already exists, then NCO will prompt you with this:

ncks: exists—e'xit,o’verwrite (i.e., delete existing file), or `a’ppend (i.e., replace duplicate variables in and add new variables to existing file) (e/o/a)?

So you can enter a and hit ‘return’.

My bash aliases

This post was originally published here

If you find yourself using some commands always with the same flags, then it would make sense to define them as alieses, by putting them into your .bashrc file like this (log out and back in for it to take effect):

# .bashrc

# Put user specific aliases and functions here
alias ls='ls -al --color=auto'
alias qstat='qstat -a'
alias qsub='qsub -m abe -M'

alias disk="du * -sh | sort -h"

-a for ls shows hidden files (files that start with a dot, like .bashrc) and -l displays more information than just the file/folder names (permissions for example).

_–color=auto _colours folders, executables and symbolic links.

-a for qstat displays more information.

Both -m and -M for qsub mean messages. For -m:

b – Mail is sent at the beginning of the job.

e – Mail is sent at the end of the job.

a – Mail is sent when the job is aborted or rescheduled.

And -M is the flag before the email address(es).

The last one (I call it disk) displays the sizes of one level of subfolders and orders them too (correct ordering is done by the really cool -h option, as apposed to the numeric sort -n, which would think that 1.4GB>1.4TB).

Add up two variables of a netCDF file

This post was originally published here

NCO:ncap2 is the function to do it:

ncap2 -s 'new_var=var1+var2'

The output file will have all of the variables that exist in the input file as well as the new_var. Add -O if your input and output files are the same (overwrite).

I do not know what the -s stands for.

BUT the new_var will have the same long_name as the first variable used for summing (i.e. it could make some things a bit confusing). To change it, use a very complicated (but allegedly also very powerful) NCO:ncatted. Fortunately, its documentation has just the right example:

Change the value of the long_name attribute for variable T from whatever it currently is to “temperature”:

ncatted -a long_name,T,o,c,temperature