This post was originally published here

TLDR: there are two new and very intuitive R functions for reshaping data: see Examples of pivot_longer() and pivot_wider() below. At the time of writing, these new functions are extremely fresh and only exist in the development version on GitHub (see Installation), we should probably wait for the tidyverse team to officially release them (in CRAN) before putting them into day-to-day use.

Exciting!

Introduction

The juxtapose of data collection vs data analysis: data that was very easy to collect, is probably very hard to analyse, and vice versa. For example, if data is collected/written down whichever format was most convenient at the time of data collection, it is probably not recorded in a regularly shaped table, with various bits of information in different parts of the document. And even if data is collected into a table, it is often intuitive (for data entry) to include information about the same variable in different columns. For example, look at this example data I just made up:

library(tidyverse)

candydata_raw = read_csv("2019-04-07_candy_preference_data.csv")

candy_type	likes age: 5	likes age: 10	likes age: 15	gets age: 5	gets age: 10	gets age: 15
Chocolate	4	6	8	2	4	6
Lollipop	10	8	6	8	6	4

For each candy type, there are 8 columns with values. But actually, these 8 columns capture a combination of 3 variables: age, likes and eats. This is known as the wide format, and it is a convenient way to either note down or even present values. It is human-readable. For effective data analysis, however, we need data to be in the tidy data format, where each column is a single variable, and each row a single observation (https://www.jstatsoft.org/article/view/v059i10). It needs to be less human-readable and more computer-friendly.

Some of you may remember now retired reshape2::melt() or reshape2::dcast(), and many of you (inclduing myself!) have struggled remebering the arguments for tidyr::gather() and tidyr::spread(). Based on extensive community feedback, the tidyverse team have reinveted these functions using both more intuitive names, as well as clearer syntax (arguments):

Thanks to all 2649 (!!!) people who completed my survey about table shapes! I’ve done analysed the data at https://t.co/hyu1o91xRm and the new functions will be called pivot_longer() and pivot_wider() #rstats

— Hadley Wickham (@hadleywickham) March 24, 2019

Installation

These functions were added just a month ago, so these functions are not yet included in the standard version of tidyr that comes with install.packages("tidyverse") or even update.packages() (the current version of tidyr on CRAN is 0.8.3). To play with the bleeding edge versions of R packages, run install.packages("devtools") and then devtools::install_github("tidyverse/tidyr"). If you are a Mac user and it asks you “Do you want to install from sources the package which needs compilation?”, say Yes.

You might need to Restart R (Session menu at the top) and load library(tidyverse) again. You can check whether you now have these functions installed by typing in pivot_longer and pressing F1 – if a relevant Help tab pops open you got it.

Examples

candydata_longer = candydata_raw %>% 
  pivot_longer(contains("age"))

candy_type	name	value
Chocolate	likes age: 5	4
Chocolate	likes age: 10	6
Chocolate	likes age: 15	8
Chocolate	gets age: 5	2
Chocolate	gets age: 10	4
Chocolate	gets age: 15	6
Lollipop	likes age: 5	10
Lollipop	likes age: 10	8
Lollipop	likes age: 15	6
Lollipop	gets age: 5	8
Lollipop	gets age: 10	6
Lollipop	gets age: 15	4

Now, that’s already a lot better, but we still need to split the name column into the two different variables it really includes. “name” is what pivot_longer() calls this new column by default. Remember, each column is a single variable.

candydata_longer = candydata_raw %>% 
  pivot_longer(contains("age")) %>% 
  separate(name, into = c("questions", NA, "age"), convert = TRUE)

candy_type	questions	age	value
Chocolate	likes	5	4
Chocolate	likes	10	6
Chocolate	likes	15	8
Chocolate	gets	5	2
Chocolate	gets	10	4
Chocolate	gets	15	6
Lollipop	likes	5	10
Lollipop	likes	10	8
Lollipop	likes	15	6
Lollipop	gets	5	8
Lollipop	gets	10	6
Lollipop	gets	15	4

And pivot_wider() can be used to do the reverse:

candydata = candydata_longer %>% 
  pivot_wider(names_from = questions, values_from = value)

candy_type	age	likes	gets
Chocolate	5	4	2
Chocolate	10	6	4
Chocolate	15	8	6
Lollipop	5	10	8
Lollipop	10	8	6
Lollipop	15	6	4

It is important to spell out the arguments here (names_from =, values_frame =) since they are not the second and third arguments of pivot_wider() (like they were in spread()). Investigate the pivot_wider+F1 Help tab for more information.

Wrap-up and notes

Now these are datasets we can work with: each column is a variable, each row is an observation.

Do not start replacing working and tested instances of gather() or spread() in your existing R code with these new functions. That is neither efficient nor necessary – gather() and spread() will remain in tidyr to make sure people’s scripts don’t suddenly stop working. Meaning: tidyr is backward compatible. But after these functions are officially released, I will start using them in all new scripts I write.

I made the original messy columns still relatively nice to work with – no typos and reasonable delimiters. Usually, the labels are much worse and need the help of janitor::clean_names(), stringr::str_replace(), and multiple iterations of tidyr::separate() to arrive at a nice tidy tibble/data frame.

tidyr::separate() tips:

into = c("var1", NA, "var2") – now this is an amazing trick I only came across this week! This is a convenient way to drop useless (new) columns. Previously, I would have achieved the same result with:

... %>% 
    separate(..., into = c("var1", "drop", "var2")) %>% 
    select(-drop) %>% 
    ...

convert = TRUE: by default, separate() creates new variables that are also just “characters”. This means our age would have been a chacter vector of, e.g., “5”, “10”, rather than 5, 10, and R wouldn’t have known how to do arithmetic on it. In this example, convert = TRUE is equivalent to mutate(age = as.numeric(age)).

Good luck!

P.S. This is one of the coolest Tweets I’ve ever seen:

On this auspicious day, in honor of tidyr, I am pleased to re-introduce you to pivot_longer and pivot_wider. #rstats pic.twitter.com/hj3zqYYbVf

— Ian Lyttle (@ijlyttle) April 1, 2019

New intuitive ways for reshaping data in R: long live pivot_longer() and pivot_wider()

Introduction

Installation

Examples

Wrap-up and notes

tidyr::separate() tips:

Like this:

Related

Leave a Reply Cancel reply

Introduction

Installation

Examples

Wrap-up and notes

tidyr::separate() tips:

Share this:

Like this:

Related

Leave a Reply Cancel reply