← back

Coronavirus Recovery Activity

Our task was to analyse coronavirus recovery data from India from January 2020 to March 2020 using RStudio.

The data came from here on Kaggle. Here's a sample of the data. (There are actually 270 rows.)

SnoDateState/UnionTerritoryConfirmedIndianNationalConfirmedForeignNationalCuredDeaths
130-01-2020Kerala1000
231-01-2020Kerala1000
301-02-2020Kerala2000
402-02-2020Kerala3000
503-02-2020Kerala3000

First I created a frequency table to count how many daily reports came from each state.

table(Covid19_India_Jan_20_Mar_20_$`State/UnionTerritory`)

Here are the first few results.

State/UnionTerritoryFreq
Andhra Pradesh10
Chandigarh1
Chattisgarh1
Chhattisgarh2
Delhi20

The results show that the states weren't named consistently, for example "Chhattisgarh" was named "Chattisgarh" in one record. I browsed the other records and saw that Union territories were also not named consistently, for example, "Ladakh" was sometimes named "Union Territory of Ladakh".

I used this R code to tidy the data.

library(dplyr)

#rename any states named "Chattisgarh"
Covid19_India_Jan_20_Mar_20_$`State/UnionTerritory` <- which(Covid19_India_Jan_20_Mar_20_$`State/UnionTerritory` == "Chattisgarh") %>% 
  replace(Covid19_India_Jan_20_Mar_20_$`State/UnionTerritory`, ., "Chhattisgarh")

#get row numbers of states starting with "Union Territory of "
ut <- which(substr(Covid19_India_Jan_20_Mar_20_$`State/UnionTerritory`, 1, 19) == "Union Territory of ")

#remove "Union Territory of " from the start of any state names
Covid19_India_Jan_20_Mar_20_$`State/UnionTerritory`[ut] <-
  substr(Covid19_India_Jan_20_Mar_20_$`State/UnionTerritory`[ut], 20, nchar(Covid19_India_Jan_20_Mar_20_$`State/UnionTerritory`[ut]))

I created the frequency table again with the corrected data.

table(Covid19_India_Jan_20_Mar_20_$`State/UnionTerritory`)

Here are the first few results.

State/UnionTerritoryFreq
Andhra Pradesh10
Chandigarh1
Chhattisgarh3
Delhi20
Gujarat2

I put the results in descending order to see which states appeared most frequently.

table(Covid19_India_Jan_20_Mar_20_$`State/UnionTerritory`) %>%
  as.data.frame() %>%
  arrange(desc(Freq))

Here are the first ten results.

State/UnionTerritoryFreq
Kerala52
Delhi20
Telengana20
Rajasthan19
Haryana18
Uttar Pradesh18
Ladakh15
Tamil Nadu15
Jammu and Kashmir13
Karnataka13

I wanted to find the range of dates, but the Date column was "character" data type, so I made a new column date_converted with "Date" data type, then found the difference between the first and last dates.

Covid19_India_Jan_20_Mar_20_$date_converted <- as.Date(Covid19_India_Jan_20_Mar_20_$Date, format = "%d-%m-%Y")

max(Covid19_India_Jan_20_Mar_20_$date_converted) - min(Covid19_India_Jan_20_Mar_20_$date_converted)

Time difference of 51 days

I created a frequency table to count how many daily reports showed recoveries during this period.

table(Covid19_India_Jan_20_Mar_20_$Cured == 0)
  
FALSE  TRUE
   55   215

I also checked this produced the opposite result if I used Cured > 0.

table(Covid19_India_Jan_20_Mar_20_$Cured > 0)

FALSE  TRUE
  215    55

I turned the results into percentages.

table(Covid19_India_Jan_20_Mar_20_$Cured == 0)/nrow(Covid19_India_Jan_20_Mar_20_) * 100

   FALSE     TRUE
20.37037 79.62963

I added a variable has_recovery that indicates whether any recoveries were reported.

Covid19_India_Jan_20_Mar_20_$has_recovery <- Covid19_India_Jan_20_Mar_20_$Cured > 0
SnoDateState/UnionTerritoryConfirmedIndianNationalConfirmedForeignNationalCuredDeathshas_recovery
130-01-2020Kerala1000FALSE
231-01-2020Kerala1000FALSE
301-02-2020Kerala2000FALSE
402-02-2020Kerala3000FALSE
503-02-2020Kerala3000FALSE

I added another variable has_deaths that indicates whether any deaths were reported.

Covid19_India_Jan_20_Mar_20_$has_deaths <- Covid19_India_Jan_20_Mar_20_$Deaths > 0
SnoDateState/UnionTerritoryConfirmedIndianNationalConfirmedForeignNationalCuredDeathshas_recoveryhas_deaths
130-01-2020Kerala1000FALSEFALSE
231-01-2020Kerala1000FALSEFALSE
301-02-2020Kerala2000FALSEFALSE
402-02-2020Kerala3000FALSEFALSE
503-02-2020Kerala3000FALSEFALSE

I created a frequency table to count how many daily reports showed deaths during this period.

table(Covid19_India_Jan_20_Mar_20_$has_deaths)
  
FALSE  TRUE
  245    25

I converted the has_deaths variable to a "factor" variable has_deaths_factor with labels "No Deaths" and "Deaths Reported".

Covid19_India_Jan_20_Mar_20_$has_deaths_factor <- factor(Covid19_India_Jan_20_Mar_20_$has_deaths, labels = c("No Deaths", "Deaths Reported"))
SnoDateState/UnionTerritoryConfirmedIndianNationalConfirmedForeignNationalCuredDeathshas_recoveryhas_deathshas_deaths_factor
130-01-2020Kerala1000FALSEFALSENo Deaths
231-01-2020Kerala1000FALSEFALSENo Deaths
301-02-2020Kerala2000FALSEFALSENo Deaths
402-02-2020Kerala3000FALSEFALSENo Deaths
503-02-2020Kerala3000FALSEFALSENo Deaths

Finally I created a "categorical" variable case_level based on total confirmed cases - Indian nationals and foreign nationals.
0 total cases: "No Cases"
1-5 total cases: "Low Cases"
6-15 total cases: "Medium Cases"
16+ total cases: "High Cases"

Covid19_India_Jan_20_Mar_20_$case_level <- as.factor(ifelse(Covid19_India_Jan_20_Mar_20_$ConfirmedIndianNational + Covid19_India_Jan_20_Mar_20_$ConfirmedForeignNational < 1, "No Cases",
                                                            ifelse(Covid19_India_Jan_20_Mar_20_$ConfirmedIndianNational + Covid19_India_Jan_20_Mar_20_$ConfirmedForeignNational < 6, "Low Cases",
                                                                   ifelse(Covid19_India_Jan_20_Mar_20_$ConfirmedIndianNational + Covid19_India_Jan_20_Mar_20_$ConfirmedForeignNational < 16, "Medium Cases", "High Cases"))))
SnoDateState/UnionTerritoryConfirmedIndianNationalConfirmedForeignNationalCuredDeathshas_recoveryhas_deathshas_deaths_factorcase_level
130-01-2020Kerala1000FALSEFALSENo DeathsLow Cases
231-01-2020Kerala1000FALSEFALSENo DeathsLow Cases
301-02-2020Kerala2000FALSEFALSENo DeathsLow Cases
402-02-2020Kerala3000FALSEFALSENo DeathsLow Cases
503-02-2020Kerala3000FALSEFALSENo DeathsLow Cases

I made a bar chart showing the frequency of COVID-19 reports by state.

Covid19_India_Jan_20_Mar_20_ %>% 
  ggplot(aes(y = fct_rev(`State/UnionTerritory`))) + 
  geom_bar() + 
  labs(y = "state", title = "COVID-19 reports from Indian states Jan-Mar 2020")

I made a pie chart showing the distribution of case severity levels, using total cases of Indian nationals and foreign nationals.

Covid19_India_Jan_20_Mar_20_ %>%
  group_by(case_level) %>% 
  summarise(count = n()) %>% 
  ggplot(aes(x = "", y = count, fill = case_level)) + 
  geom_col() + 
  coord_polar(theta = "y") + 
  geom_text(aes(label = count), position = position_stack(vjust = .5)) + 
  labs(x = "", y = "", fill = "", title = "COVID-19 reports in India Jan-Mar 2020") + 
  scale_fill_discrete(breaks = c("High Cases", "Medium Cases", "Low Cases")) + 
  theme(axis.text.x = element_blank())

I made a histogram showing the distribution of recovery numbers.

Covid19_India_Jan_20_Mar_20_ %>% 
  ggplot(aes(Cured)) + 
  geom_histogram(bins = 10) + 
  labs(title = "Distribution of COVID-19 recovery numbers in India Jan-Mar 2020")

I made a line chart showing the trend of total cases over time.

Covid19_India_Jan_20_Mar_20_ %>% 
  arrange(date_converted) %>% 
  group_by(date_converted) %>% 
  summarise(total_cases = sum(ConfirmedIndianNational + ConfirmedForeignNational)) %>% 
  mutate(cum_sum = cumsum(total_cases)) %>% 
  ggplot(aes(date_converted, cum_sum)) + 
  geom_line() + 
  labs(title = "Cumulative total of COVID-19 cases in India Jan-Mar 2020", x = "date", y = "cumulative total")

I made a list of states showing total number of cases (Indian nationals and foreign nationals) in order from highest to lowest.

Covid19_India_Jan_20_Mar_20_ %>% 
  group_by(`State/UnionTerritory`) %>% 
  summarise(total_cases = sum(ConfirmedIndianNational + ConfirmedForeignNational)) %>% 
  arrange(desc(total_cases))

Here are the first ten results.

State/UnionTerritorytotal_cases
Kerala406
Maharashtra355
Uttar Pradesh214
Haryana181
Rajasthan175
Delhi133
Karnataka103
Telengana74
Ladakh71
Jammu and Kashmir30

I made the same list again, but added the total number cured.

Covid19_India_Jan_20_Mar_20_ %>% 
  group_by(`State/UnionTerritory`) %>% 
  summarise(total_cases = sum(ConfirmedIndianNational + ConfirmedForeignNational), total_cured = sum(Cured)) %>% 
  arrange(desc(total_cases))

Here are the first ten results.

State/UnionTerritorytotal_casestotal_cured
Kerala40657
Maharashtra3550
Uttar Pradesh21450
Haryana1810
Rajasthan17522
Delhi13322
Karnataka1032
Telengana747
Ladakh710
Jammu and Kashmir300

We used the data to calculate probabilities using Bayes Theorem.

the probability that a randomly selected report comes from Kerala P(Kerala) = 52 270 = 19.3%
the probability that a randomly selected report shows recoveries P(recoveries) = 55 270 = 20.4%
the probability that a Kerala report shows recoveries P(recoveries|Kerala) = 19 52 = 36.5%
the probability that a Delhi report shows recoveries P(recoveries|Delhi) = 8 20 = 40%
if a report shows recoveries, the probability that it came from Kerala P(Kerala|recoveries) = P(recoveries|Kerala) x P(Kerala) P(recoveries) = 0.365 x 0.193 0.204 = 34.5%

Working through these activities really helped me apply what I've learned about RStudio, and I'm sure I'll be using RStudio a lot in my career as a data scientist!