World map using the tidyverse (ggplot2) and an equal-area projection

This post was originally published here

There are several different ways to make maps in R, and I always have to look it up and figure this out again from previous examples that I’ve used. Today I had another look at what’s currently possible and what’s an easy way of making a world map in ggplot2 that doesn’t require fetching data from various places.
TLDR: Copy this code to plot a world map using the tidyverse:

Reshaping multiple variables into tidy data (wide to long)

This post was originally published here

There’s some explanation on what reshaping data in R means, why we do it, as well as the history, e.g., melt() vs gather() vs pivot_longer() in a previous post: New intuitive ways for reshaping data in R
That post shows how to reshape a single variable that had been recorded/entered across multiple different columns. But if multiple different variables are recorded over multiple different columns, then this is what you might want to do:

Setting up a simple one page website using Nicepage and Netlify

This post was originally published here

I’ve just set up a single page website (= online business card) for myself and my husband: https://pius.cloud/ . This post summarises what I did. If you’re looking to get started with something super quickly, then only the first two steps are essential (Creating a website and Serving a website).
Creating a website (using Nicepage) I’ve created websites using various tools such as straight up HTML, WordPress, Hugo+blogdown (this site – riinu.

R: filtering with NA values

This post was originally published here

NA – Not Available/Not applicable is R’s way of denoting empty or missing values. When doing comparisons – such as equal to, greater than, etc. – extra care and thought needs to go into how missing values (NAs) are handled. More explanations about this can be found in the Chapter 2: R basics of our book that is freely available at the HealthyR website
This post lists a couple of different ways of keeping or discarding rows based on how important the variables with missing values are to you.

RStudio Server LAN party: Laptop+Router+Docker to serve RStudio offline

This post was originally published here

TLDR: You can teach R on people’s own laptops without having them install anything or require an internet connection.

Members of the Surgical Informatics team in Ghana, 2019. More information: surgicalinformatics.org

Members of the Surgical Informatics team in Ghana, 2019. More information: surgicalinformatics.org

Introduction

Running R programming courses on people’s own laptops is a pain, especially as we use a lot of very useful extensions that actually make learning and using R much easier and more fun. But long installation instructions can be very off-putting for complete beginners, and people can be discouraged to learn programming if installation hurdles invoke their imposter syndrome.

We almost always run our courses in places with a good internet connection (it does not have to be super fast or flawless), so we get our students all set up on RStudio Server (hosted by us) or https://rstudio.cloud (a free service provided by RStudio!).
You connect to either of these options using a web browser, and even very old computers can handle this. That’s because the actual computations happen on the server and not on the student’s computer. So the computer just serves as a window to the training instance used.

Now, these options work really well as long as you have a stable internet connection. But for teaching R offline and on people’s own laptops, you either have to:

  1. make sure everyone installs everything correctly before they attend the course
  2. Download all the software and extensions, put them on USB sticks and try to install them together at the start
  3. start serving RStudio from a your computer using Local Area Network (LAN) created by a router

Now, we already discussed why the first option is problematic (gatekeeper for complete beginners). The second option – installing everything at the start together – means that you start the course with the most boring part. And since everyone’s computers are different (both by operating systems as well as different versions of the operating systems), this can take quite a while to sort. Therefore, queue in option c) – an RStudio Server LAN party.

Requirements

  1. A computer with more than 4GB of RAM. macOS alone uses around 2-3GB just to keep going, and running the RStudio Server docker container was using another 3-4 GB, so you’ll definitely need more than 4GB in total.
  2. A network router. For a small number of participants, the same one you already have at home will work. Had to specify “network” here, as apparently, even my Google search for “router” suggests the power tool before network routers.
  3. Docker – free software, dead easy to install on macOS (search the internet for “download Docker”). Looks like installation on the Windows Home operating system might be trickier. If you are a Windows Home user who is using Docker, please do post a link to your favourite instructions in the comments below.
  4. Internet connection for setting up – to download RStudio’s docker image and install your extra packages.
My MacBook Pro serving RStudio to 10 other computers in Ghana, November 2019.

My MacBook Pro serving RStudio to 10 other computers in Ghana, November 2019.

Set-up

Running RStudio using Docker is so simple you won’t believe me. It honestly is just a single-liner to be entered into your Terminal (Command Prompt on Windows):

docker run -d -p 8787:8787 -e ROOT=TRUE -e USER=user -e PASSWORD=password rstudio/verse 

This will automatically download a Docker image put together by RStudio. The one called verse includes all the tidyverse packages as well as publishing-related ones (R Markdown, Shiny, etc.). You can find a list of the difference ones here: https://github.com/rocker-org/rocker

Then open a browser and go to localhost:8787 and you should be greeted with an RStudio Server login! (Localhost only works on a Mac or Linux, if using Windows, take a note of your IP address and use that instead of localhost.) More information and instructions can be found here: https://github.com/rocker-org/rocker/wiki/Using-the-RStudio-image

Tip: RStudio suggests port 8787, which is what I used for consistency, but if you set it up on 80 you can omit the :80 as that’s the default anyway. So you can just go to localhost (or something like 127.0.0.0 if using Windows).

For those of you who have never seen or used RStudio Server, this is what it looks like:

Rstudio Server is almost identical to RStudio Desktop. Main difference is the “Upload” button in the Files pane. This one is running in a Docker container, served at port 8787, and accessed using Safari (but any web browser will work).

Rstudio Server is almost identical to RStudio Desktop. Main difference is the “Upload” button in the Files pane. This one is running in a Docker container, served at port 8787, and accessed using Safari (but any web browser will work).

The Docker single-liner above will create a single user with sudo rights (since I’ve included -e ROOT=TRUE). After logging into the instance, you can then add other users and copy the course materials to everyone using these scripts: https://github.com/einarpius/create_rstudio_users Note that the instance is running Debian, so you’ll need very basic familiarity with managing file permissions on the command line. For example, you’ll need to make the scripts executable with chmod 700 create_users.sh.

Then connect to the same router you’ll be using for your LAN party, go to router settings and assign yourself a fixed IP address, e.g., 168.192.1.78. Once other people connect to the network created by this router (either by WiFi or cable), they need to type 168.192.1.78:8787 into any browser and can just start using RStudio. This will work as long as your computer is running Docker and you are all connected to the same router.

I had 10 people connected to my laptop and, most of the time, the strain on my CPU was negligible – around 10-20%. That’s because it was a course for complete beginners and they were mostly reading the instructions (included in the training Notebooks they were running R code in). So they weren’t actually hitting Run at the same time, and the tasks weren’t computationally heavy. When we did ask everyone to hit the “Knit to PDF” button all at the same time, it got a bit slower and my CPU was apparently working at 200%. But nothing crashed and everyone got their PDFs made.

Why are you calling it a LAN party?

My friends and I having a LAN party in Estonia, 2010. We would mostly play StarCraft or Civilization, or as pictured here - racing games to wind down at the end.

My friends and I having a LAN party in Estonia, 2010. We would mostly play StarCraft or Civilization, or as pictured here – racing games to wind down at the end.

LAN stands for Local Area Network and in most cases means “devices connected to the same WiFi*”. You’ve probably used LANs lots in your life without even realising. One common example is printers: you know when a printer asks you to connect to the same network to be able to print your files? This usually means your computer and the printer will be in a LAN. If your printed accepted files via any internet connection, rather than just the same local network, then people around the world could submit stuff for your printer. Furthermore, if you have any smart devices in your home, they’ll be having a constant LAN party with each other.

The term “LAN party” means people coming together to play multiplayer computer games – as it will allow people to play in the same “world”, to either build things together or fight with each other. Good internet access has made LAN parties practically obsolete – people and their computers no longer have to physically be in the same location to play multiplayer games together. I use the term very loosely to refer to anything fun happening on the same network. And being able to use RStudio is definitely a party in my books anyway.

But it is for security reasons (e.g., the printer example), or sharing resources in places without excellent internet connection where LAN parties are still very much relevant.

* Overall, most existing LANs operate via Ethernet cables (or “internet cables” as most people, including myself refer to them). WiFi LAN or WLAN is a type of LAN. Have a look at your home router, it will probably have different lights for “internet” and “WLAN”/“wireless”. A LAN can also be connected to the internet – if the router itself is connected to the internet. That’s the main purpose of a router – to take the internet coming into your house via a single Ethernet cable, and share it with all your other devices. A LAN is usually just a nice side-effect of that.

Docker, containers, images

Docker image – a file bundling an operating system + programs and files
Docker container – a running image (it may be paused or stopped)

List of all your containers: docker ps -a (just docker ps will list running containers, so the ones not stopped or paused)

List your images: docker images

Run a container using an image:

docker run -d -p 8787:8787 -e ROOT=TRUE -e USER=user -e PASSWORD=password rstudio/verse 

When you run rstudio/verse for the first time it will be downloaded into your images. The next time it will be taken directly from there, rather than downloaded. So you’ll only need internet access once.

Stop an active container: docker stop container-name

Start it up again: docker start container-name

Save a container as an image (for versioning or passing on to other people):

docker commit container-name pository:tag

For example: docker commit rstudio-server rstudio/riinu:test1

Rename container (by default it will get a random label, I’d change it to rstudio-server):

docker rename happy_hippo rstudio-server

You can then start your container with: docker start rstudio-server

New intuitive ways for reshaping data in R: long live pivot_longer() and pivot_wider()

This post was originally published here

TLDR: there are two new and very intuitive R functions for reshaping data: see Examples of pivot_longer() and pivot_wider() below. At the time of writing, these new functions are extremely fresh and only exist in the development version on GitHub (see Installation), we should probably wait for the tidyverse team to officially release them (in CRAN) before putting them into day-to-day use.

Exciting!

Introduction

The juxtapose of data collection vs data analysis: data that was very easy to collect, is probably very hard to analyse, and vice versa. For example, if data is collected/written down whichever format was most convenient at the time of data collection, it is probably not recorded in a regularly shaped table, with various bits of information in different parts of the document. And even if data is collected into a table, it is often intuitive (for data entry) to include information about the same variable in different columns. For example, look at this example data I just made up:

library(tidyverse)

candydata_raw = read_csv("2019-04-07_candy_preference_data.csv")
candy_type likes age: 5 likes age: 10 likes age: 15 gets age: 5 gets age: 10 gets age: 15
Chocolate 4 6 8 2 4 6
Lollipop 10 8 6 8 6 4

For each candy type, there are 8 columns with values. But actually, these 8 columns capture a combination of 3 variables: age, likes and eats. This is known as the wide format, and it is a convenient way to either note down or even present values. It is human-readable. For effective data analysis, however, we need data to be in the tidy data format, where each column is a single variable, and each row a single observation (https://www.jstatsoft.org/article/view/v059i10). It needs to be less human-readable and more computer-friendly.

Some of you may remember now retired reshape2::melt() or reshape2::dcast(), and many of you (inclduing myself!) have struggled remebering the arguments for tidyr::gather() and tidyr::spread(). Based on extensive community feedback, the tidyverse team have reinveted these functions using both more intuitive names, as well as clearer syntax (arguments):

Installation

These functions were added just a month ago, so these functions are not yet included in the standard version of tidyr that comes with install.packages("tidyverse") or even update.packages() (the current version of tidyr on CRAN is 0.8.3). To play with the bleeding edge versions of R packages, run install.packages("devtools") and then devtools::install_github("tidyverse/tidyr"). If you are a Mac user and it asks you “Do you want to install from sources the package which needs compilation?”, say Yes.

You might need to Restart R (Session menu at the top) and load library(tidyverse) again. You can check whether you now have these functions installed by typing in pivot_longer and pressing F1 – if a relevant Help tab pops open you got it.

Examples

candydata_longer = candydata_raw %>% 
  pivot_longer(contains("age"))
candy_type name value
Chocolate likes age: 5 4
Chocolate likes age: 10 6
Chocolate likes age: 15 8
Chocolate gets age: 5 2
Chocolate gets age: 10 4
Chocolate gets age: 15 6
Lollipop likes age: 5 10
Lollipop likes age: 10 8
Lollipop likes age: 15 6
Lollipop gets age: 5 8
Lollipop gets age: 10 6
Lollipop gets age: 15 4

Now, that’s already a lot better, but we still need to split the name column into the two different variables it really includes. “name” is what pivot_longer() calls this new column by default. Remember, each column is a single variable.

candydata_longer = candydata_raw %>% 
  pivot_longer(contains("age")) %>% 
  separate(name, into = c("questions", NA, "age"), convert = TRUE)
candy_type questions age value
Chocolate likes 5 4
Chocolate likes 10 6
Chocolate likes 15 8
Chocolate gets 5 2
Chocolate gets 10 4
Chocolate gets 15 6
Lollipop likes 5 10
Lollipop likes 10 8
Lollipop likes 15 6
Lollipop gets 5 8
Lollipop gets 10 6
Lollipop gets 15 4

And pivot_wider() can be used to do the reverse:

candydata = candydata_longer %>% 
  pivot_wider(names_from = questions, values_from = value)
candy_type age likes gets
Chocolate 5 4 2
Chocolate 10 6 4
Chocolate 15 8 6
Lollipop 5 10 8
Lollipop 10 8 6
Lollipop 15 6 4

It is important to spell out the arguments here (names_from =, values_frame =) since they are not the second and third arguments of pivot_wider() (like they were in spread()). Investigate the pivot_wider+F1 Help tab for more information.

Wrap-up and notes

Now these are datasets we can work with: each column is a variable, each row is an observation.

Do not start replacing working and tested instances of gather() or spread() in your existing R code with these new functions. That is neither efficient nor necessary – gather() and spread() will remain in tidyr to make sure people’s scripts don’t suddenly stop working. Meaning: tidyr is backward compatible. But after these functions are officially released, I will start using them in all new scripts I write.

I made the original messy columns still relatively nice to work with – no typos and reasonable delimiters. Usually, the labels are much worse and need the help of janitor::clean_names(), stringr::str_replace(), and multiple iterations of tidyr::separate() to arrive at a nice tidy tibble/data frame.

tidyr::separate() tips:

into = c("var1", NA, "var2") – now this is an amazing trick I only came across this week! This is a convenient way to drop useless (new) columns. Previously, I would have achieved the same result with:

... %>% 
    separate(..., into = c("var1", "drop", "var2")) %>% 
    select(-drop) %>% 
    ...
    

convert = TRUE: by default, separate() creates new variables that are also just “characters”. This means our age would have been a chacter vector of, e.g., “5”, “10”, rather than 5, 10, and R wouldn’t have known how to do arithmetic on it. In this example, convert = TRUE is equivalent to mutate(age = as.numeric(age)).

Good luck!

P.S. This is one of the coolest Tweets I’ve ever seen:

Global map of country names

This post was originally published here

This post demonstrates the use of two very cool R packages – ggrepel and patchwork.

ggrepel deals with overlapping text labels (Code#1 at the bottom of this post):

patchwork is a very convenient new package for combining multiple different plots together (i.e. what we usually to use grid and gridExtra for).

More info:

To really demonstrate the power of them, let’s make a global map of country names using ggrepel:

Now this is very good already with hardly any overlapping labels and the world is pretty recognisable. And really, you can make this plot with just 2 lines of code:

So what these two lines make is already very amazing.

But I feel like Europe is a little bit misshapen and that the Caribbean and Africa are too close together. So I divided the world into regions (in this case same as continents except Russia is it’s own region – it’s just so big). Then wrote two functions that asked ggrepel to plot each region separately and use patchwork to patch each region together:

This gives continents a much better shape, but it does severaly misplace Polynesia. See if you can find where, e.g., Tonga is and where it should be.

To see what I did with patchwork there, let’s add black borders to each region (Code#2):

Code#1:

Code#2:

My data science toolbox

This post was originally published here

I’ve been doing data science for over 10 years now. Although most of this time I didn’t realise I was doing data science. I thought I was just doing normal science but focusing on simulations and data analysis, rather than field or lab work. I’ve switched fields a few times now- physics BSc, Chemistry PhD, now working in medical research. Therefore, instead of this lenghty introduction:

“I’m a physicist by background with substantial interdisciplinary expertise in simulations, data analysis, programming…”

I just go with:

I’m a data scientist.

Anyway, here’s how my toolbox and technical skills have evolved over the years:

My data science toolbox evolution.

P.S. Once a physicist, always a physicist.

Islay distilleries in 3 days

This post was originally published here

Day 0 (Sunday 18-February 2018)

Left Edinburgh at 8am for a 1pm ferry Kennacraig to Port Askaig (Islay). Edinburgh-Kennacraig should be a 3.5h drive (and it was), but we left early to allow for any delays on the road. Arrived on Islay at 3pm and our accommodation near Port Ellen (southern Islay, close to to Ardbeg, Lagavulin, Laphroiaig) was a 40 min drive from the port.

Map of Islay with all its lovely distilleries.

Map of Islay with all its lovely distilleries.

Day 1 (Monday 19-February 2018): Ardbeg, Lagavulin, Laphroiaig

We hadn’t booked anything other than the ferry and accommodation. February is very low season so we were right to think that no other advance bookings were necessary.

We had a lazy morning and drove to Laphroaig at about 11am. We asked which tours or tasting events were on that day and booked Einar onto the Layers of Laphroaig tasting at 3pm (as the driver, I was allowed to accompany him for free). We then drove to Lagavulin (just a few miles from Laphroaig) and booked us onto the tour at 1pm. We then drove to Ardbeg (another few miles) and had second breakfast at their cafe. Then drove back to Lagavulin for the tour, and then back to Laphroaig for the testing.

Ardbeg’s epic cafe.

Ardbeg’s epic cafe.

Waiting for the tour to begin at Lagavulin’s homey tasting room.

Waiting for the tour to begin at Lagavulin’s homey tasting room.

In Laphroiaig’s tasting room: The Layers of Laphroiaig introduced whiskies from different casks that make up their range of malts. These include ex-bourbon, virgin oak (I did not know Scotch could be matured in virgin casks - I thought it always had to be ex-something!), ex-sherry, ex-port. We were the only ones booked on this so it ended up being a private tasting.

In Laphroiaig’s tasting room: The Layers of Laphroiaig introduced whiskies from different casks that make up their range of malts. These include ex-bourbon, virgin oak (I did not know Scotch could be matured in virgin casks – I thought it always had to be ex-something!), ex-sherry, ex-port. We were the only ones booked on this so it ended up being a private tasting.

Day 2 (Tuesday 20-Febaruary 2018): Kilchoman, Bruichladdich

Einar drove us to Kilchoman where I had a tasting of their 3 limited edition malts in the visitor centre. Kilchoman is a “farm-distillery” and they even grow some of their own barley. We bought a bottle of their “100% Islay” which is made from barley grown at the premises. Unfortunately, we completely forgot to take any pictures there. Must go back.

Driving on Islay.

Driving on Islay.

We then went to Bruichladdich and booked me on the Warehouse Experience at 2pm. Simiarly to Laphroaig, the driver was allowed to accompany for free. We had lunch at Port Charlotte while waiting for the event.

Bruichladdich warehouse experience

Bruichladdich warehouse experience

We then went by Bowmore (it was nearly 5pm) and asked about the different tours and experiences they have on the next day. Decided to do the “Bottle Your Own in the Vaults” first thing on Wednesday morning.

Day 3 (Wednesday 21-February 2018): Bowmore, Bunnahabhain, Ardnahoe, Caol Ila

Bottling a 17-year-old sherry cask beauty at Bowmore.

Bottling a 17-year-old sherry cask beauty at Bowmore.

We then dropped by Bunnahabhain – no tours were running that but we were offered a few free tasters at the shop. On our way back from Bunnahabhain we took a picture at Ardanahoe (a new distillery that opens any day now).

Visiting Bunnahabhain and stopping at soon to be opened Ardnahoe.

Visiting Bunnahabhain and stopping at soon to be opened Ardnahoe.

The final distillery was Caol Ila where we went on the standard tour. The view in the stills room was just out of this world. They didn’t allow us to take pictures inside, so I took this from their website:

Caol Ila stills with a view of the Isle of Jura. Picture from: https://www.malts.com/en-row/distilleries/caol-ila/

Caol Ila stills with a view of the Isle of Jura. Picture from: https://www.malts.com/en-row/distilleries/caol-ila/

Me outside Caol Ila with Jura in the background

Me outside Caol Ila with Jura in the background

What we brought back with us

In addition to whisky distilleries, we also visited a nano-brewery, and it turns out that The Botanist (a gin) is made at Bruichladdich.

In addition to whisky distilleries, we also visited a nano-brewery, and it turns out that The Botanist (a gin) is made at Bruichladdich.