TLDR: there are two new and very intuitive R functions for reshaping data: see Examples of
pivot_wider() below. At the time of writing, these new functions are extremely fresh and only exist in the development version on GitHub (see Installation), we should probably wait for the tidyverse team to officially release them (in CRAN) before putting them into day-to-day use.
The juxtapose of data collection vs data analysis: data that was very easy to collect, is probably very hard to analyse, and vice versa. For example, if data is collected/written down whichever format was most convenient at the time of data collection, it is probably not recorded in a regularly shaped table, with various bits of information in different parts of the document. And even if data is collected into a table, it is often intuitive (for data entry) to include information about the same variable in different columns. For example, look at this example data I just made up:
library(tidyverse) candydata_raw = read_csv("2019-04-07_candy_preference_data.csv")
|candy_type||likes age: 5||likes age: 10||likes age: 15||gets age: 5||gets age: 10||gets age: 15|
For each candy type, there are 8 columns with values. But actually, these 8 columns capture a combination of 3 variables:
eats. This is known as the wide format, and it is a convenient way to either note down or even present values. It is human-readable. For effective data analysis, however, we need data to be in the tidy data format, where each column is a single variable, and each row a single observation (https://www.jstatsoft.org/article/view/v059i10). It needs to be less human-readable and more computer-friendly.
Some of you may remember now retired
reshape2::dcast(), and many of you (inclduing myself!) have struggled remebering the arguments for
tidyr::spread(). Based on extensive community feedback, the tidyverse team have reinveted these functions using both more intuitive names, as well as clearer syntax (arguments):
Thanks to all 2649 (!!!) people who completed my survey about table shapes! I’ve done analysed the data at https://t.co/hyu1o91xRm and the new functions will be called pivot_longer() and pivot_wider() #rstats
— Hadley Wickham (@hadleywickham) March 24, 2019
These functions were added just a month ago, so these functions are not yet included in the standard version of
tidyr that comes with
install.packages("tidyverse") or even
update.packages() (the current version of
tidyr on CRAN is 0.8.3). To play with the bleeding edge versions of R packages, run
install.packages("devtools") and then
devtools::install_github("tidyverse/tidyr"). If you are a Mac user and it asks you “Do you want to install from sources the package which needs compilation?”, say Yes.
You might need to Restart R (Session menu at the top) and load
library(tidyverse) again. You can check whether you now have these functions installed by typing in
pivot_longer and pressing F1 – if a relevant Help tab pops open you got it.
candydata_longer = candydata_raw %>% pivot_longer(contains("age"))
|Chocolate||likes age: 5||4|
|Chocolate||likes age: 10||6|
|Chocolate||likes age: 15||8|
|Chocolate||gets age: 5||2|
|Chocolate||gets age: 10||4|
|Chocolate||gets age: 15||6|
|Lollipop||likes age: 5||10|
|Lollipop||likes age: 10||8|
|Lollipop||likes age: 15||6|
|Lollipop||gets age: 5||8|
|Lollipop||gets age: 10||6|
|Lollipop||gets age: 15||4|
Now, that’s already a lot better, but we still need to split the
name column into the two different variables it really includes. “name” is what
pivot_longer() calls this new column by default. Remember, each column is a single variable.
candydata_longer = candydata_raw %>% pivot_longer(contains("age")) %>% separate(name, into = c("questions", NA, "age"), convert = TRUE)
pivot_wider() can be used to do the reverse:
candydata = candydata_longer %>% pivot_wider(names_from = questions, values_from = value)
It is important to spell out the arguments here (
values_frame =) since they are not the second and third arguments of
pivot_wider() (like they were in
spread()). Investigate the
pivot_wider+F1 Help tab for more information.
Wrap-up and notes
Now these are datasets we can work with: each column is a variable, each row is an observation.
Do not start replacing working and tested instances of
spread() in your existing R code with these new functions. That is neither efficient nor necessary –
spread() will remain in
tidyr to make sure people’s scripts don’t suddenly stop working. Meaning:
tidyr is backward compatible. But after these functions are officially released, I will start using them in all new scripts I write.
I made the original messy columns still relatively nice to work with – no typos and reasonable delimiters. Usually, the labels are much worse and need the help of
stringr::str_replace(), and multiple iterations of
tidyr::separate() to arrive at a nice tidy tibble/data frame.
into = c("var1", NA, "var2") – now this is an amazing trick I only came across this week! This is a convenient way to drop useless (new) columns. Previously, I would have achieved the same result with:
... %>% separate(..., into = c("var1", "drop", "var2")) %>% select(-drop) %>% ...
convert = TRUE: by default,
separate() creates new variables that are also just “characters”. This means our age would have been a chacter vector of, e.g., “5”, “10”, rather than 5, 10, and R wouldn’t have known how to do arithmetic on it. In this example,
convert = TRUE is equivalent to
mutate(age = as.numeric(age)).
P.S. This is one of the coolest Tweets I’ve ever seen:
— Ian Lyttle (@ijlyttle) April 1, 2019