← back

R for Data Science Exercises

Our task was to complete the exercises from chapters 5 and 7 of the brilliant R for Data Science textbook.

Chapter 5

These exercises were for practising handling data that is not "tidy" - data that doesn't follow these rules.

  1. Each variable is a column; each column is a variable.
  2. Each observation is a row; each row is an observation.
  3. Each value is a cell; each cell is a single value.

Our task was to calculate tuberculosis rates from tables of "untidy" data, without making it tidy first.

table2 has a column for the value "cases" or "population" - this is "untidy" because these values could be column names.

table2 <- tribble(
  ~country, ~year, ~type, ~count,
  "Afghanistan", 1999, "cases", 745,
  "Afghanistan", 1999, "population", 19987071,
  "Afghanistan", 2000, "cases", 2666,
  "Afghanistan", 2000, "population", 20595360,
  "Brazil", 1999, "cases", 37737,
  "Brazil", 1999, "population", 172006362,
  "Brazil", 2000, "cases", 80488,
  "Brazil", 2000, "population", 174504898,
  "China", 1999, "cases", 212258,
  "China", 1999, "population", 1272915272,
  "China", 2000, "cases", 213766,
  "China", 2000, "population", 1280428583
)

First I extracted the values for "cases" and "population" into vectors.

table2_cases <- table2 |>
  filter(type == "cases") |>
  select(count) |>
  pull(count)

table2_population <- table2 |>
  filter(type == "population") |>
  select(count) |>
  pull(count)

I used these new vectors to create a vector for "rate" - the number of tuberculosis cases for every 10,000 people.

table2_rate <- table2_cases / table2_population * 10000

I looped backwards through table2 adding rows for "rate".

for(i in seq(nrow(table2), 1, by = -2)) {
  table2 <- add_row(table2, country = table2$country[i], year = table2$year[i], type = "rate", count = table2_rate[i/2], .after = i)
}

I used formatC to format the "count" column - format="f" removes scientific format, and drop0trailing = TRUE drops zeros after the decimal point.

table2$count <- formatC(table2$count, format= "f", drop0trailing = TRUE)

Here's the resulting table2.

countryyeartypecount
1Afghanistan1999cases745
2Afghanistan1999population19987071
3Afghanistan1999rate0.3727
4Afghanistan2000cases2666
5Afghanistan2000population20595360
6Afghanistan2000rate1.2945
7Brazil1999cases37737
8Brazil1999population172006362
9Brazil1999rate2.1939
10Brazil2000cases80488
11Brazil2000population174504898
12Brazil2000rate4.6124
13China1999cases212258
14China1999population1272915272
15China1999rate1.6675
16China2000cases213766
17China2000population1280428583
18China2000rate1.6695

table3 has a column for the values "cases" and "population" together - this is "untidy" because these values could be in separate columns.

table3 <- tribble(
  ~country, ~year, ~rate,
  "Afghanistan", 1999, "745/19987071",
  "Afghanistan", 2000, "2666/20595360",
  "Brazil", 1999, "37737/172006362",
  "Brazil", 2000, "80488/174504898",
  "China", 1999, "212258/1272915272",
  "China", 2000, "213766/1280428583"
)

I split the "rate" column into 2 values - cases and population.

cases_population <- str_split(table3$rate, "/")

I replaced the values in the "rate" column with new calculated values - the number of tuberculosis cases for every 10,000 people.

for(i in 1:nrow(table3)) {
  table3$rate[i] <- as.numeric(cases_population[[i]][1])/as.numeric(cases_population[[i]][2]) * 10000
}

Here's the resulting table3 - note that the rates are the same as in table2.

countryyearrate
Afghanistan19990.372740958392553
Afghanistan20001.29446632639585
Brazil19992.19393047799011
Brazil20004.61236337331918
China19991.66749511667419
China20001.6694878795907

Chapter 7

These exercises were for practising importing data with read_csv().

The first exercise has 5 read_csv() statements with errors. We had to work out the intention of each statement and correct it.

1. read_csv("a,b\n1,2,3\n4,5,6")

The intention is to read 3 columns and 2 rows, but column 3 doesn't have a name - I corrected it to this.

read_csv("a,b,c\n1,2,3\n4,5,6")
abc
123
456
2. read_csv("a,b,c\n1,2\n1,2,3,4")

The rows have different numbers of columns - I corrected it to this.

read_csv("a,b,c,d\n1,2,3,4\n1,2,3,4")
abcd
1234
1234
3. read_csv("a,b\n\"1")

The quotes around the value are not closed, and the rows have different numbers of columns - I corrected it to this.

read_csv("a,b\n\"1\",\"2\"")
ab
12
4. read_csv("a,b\n1,2\na,b")

The column names seem to be repeated - I corrected it to this.

read_csv("a,b\n1,2")
ab
12
5. read_csv("a;b\n1;3")

The delimeter is a semicolon - I corrected it to this.

read_delim("a;b\n1;3", delim = ";")
ab
13

The final exercise was practising referring to "nonsyntactic" column names - the column names are integers.

annoying <- tibble(
  `1` = 1:10,
  `2` = `1` * 2 + rnorm(length(`1`))
)

Extract the variable called "1".

annoying$`1`
 [1]  1  2  3  4  5  6  7  8  9 10

Plot a scatterplot of "1" versus "2".

plot(annoying$`1`, annoying$`2`)

Create a new column called "3", which is "2" divided by "1".

annoying$`3` <- annoying$`1` / annoying$`2`
annoying$`3`
 [1] 0.4346452 0.5703123 0.4004234 0.5342338 0.4240465 0.5135067 0.5526404 0.4910714
 [9] 0.4889729 0.5659152

Rename the columns to "one", "two" and "three".

annoying <- annoying %>% 
  rename(one = `1`, two = `2`, three = `3`)
onetwothree
112.300727256314710.43464517458788
223.506850264367560.570312345617268
337.492069714818840.400423396230042
447.487359133018890.534233757047955
5511.79116017812610.424046482658725
6611.68436587695250.513506686044045
7712.66646560449150.552640351189824
8816.29091026993250.491071393031075
9918.40592818051460.488972895674333
101017.67049258403380.565915180487698

I thought these were excellent exercises, particularly the ones where we had to think about the intention of incorrect R code before correcting it.

I love using SQL, and R feels like the next logical step for me. These exercises really helped me look more deeply at the subtleties of R functions.