vignettes/intro_coronavirus_dataset.Rmd
intro_coronavirus_dataset.Rmd
The coronavirus
dataset provides a snapshot of the daily confirmed, recovered, and death cases of the Coronavirus (the 2019 Novel Coronavirus COVID-19) by geographic location (i.e., country/province). Let’s load the dataset from the coronavirus package:
The dataset has the following fields:
date
- The date of the summaryProvince.State
- The province or state, when applicableCountry.Region
- The country or region nameLat
- Latitude pointLong
- Longitude pointcases
- the number of daily cases (corresponding to the case type)type
- the type of case (i.e., confirmed, death)We can use the head
and str
functions to see the structure of the dataset:
head(coronavirus)
#> Province.State Country.Region Lat Long date cases type
#> 1 Afghanistan 33 65 2020-01-22 0 confirmed
#> 2 Afghanistan 33 65 2020-01-23 0 confirmed
#> 3 Afghanistan 33 65 2020-01-24 0 confirmed
#> 4 Afghanistan 33 65 2020-01-25 0 confirmed
#> 5 Afghanistan 33 65 2020-01-26 0 confirmed
#> 6 Afghanistan 33 65 2020-01-27 0 confirmed
str(coronavirus)
#> 'data.frame': 64574 obs. of 7 variables:
#> $ Province.State: chr "" "" "" "" ...
#> $ Country.Region: chr "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
#> $ Lat : num 33 33 33 33 33 33 33 33 33 33 ...
#> $ Long : num 65 65 65 65 65 65 65 65 65 65 ...
#> $ date : Date, format: "2020-01-22" "2020-01-23" ...
#> $ cases : int 0 0 0 0 0 0 0 0 0 0 ...
#> $ type : chr "confirmed" "confirmed" "confirmed" "confirmed" ...
We will use the dplyr and tidyr packages to query, transform, reshape, and keep the data tidy, the plotly package to plot the data and the DT package to view it:
Let’s start with summarizing the total number of cases by type as of 2020-04-13 and then plot it:
total_cases <- coronavirus %>%
group_by(type) %>%
summarise(cases = sum(cases)) %>%
mutate(type = factor(type, levels = c("confirmed", "death")))
total_cases
#> # A tibble: 3 x 2
#> type cases
#> <fct> <int>
#> 1 confirmed 1917319
#> 2 death 119482
#> 3 <NA> 448655
You can use those numbers to derive the current worldwide death rate (precentage):
round(100 * total_cases$cases[2] / total_cases$cases[1], 2)
#> [1] 6.23
plot_ly(data = total_cases,
x = ~ type,
y = ~cases,
type = 'bar',
text = ~ paste(type, cases, sep = ": "),
hoverinfo = 'text') %>%
layout(title = "Coronavirus - Cases Distribution",
yaxis = list(title = "Number of Cases"),
xaxis = list(title = "Case Type"),
hovermode = "compare")
The next plot represents the daily number of new cases worldwide:
coronavirus %>%
group_by(date, type) %>%
summarise(total = sum(cases)) %>%
pivot_wider(names_from = type, values_from = total) %>%
plot_ly(x = ~ date, y = ~ confirmed,
name = "Confirmed",
type = "scatter",
mode = "none",
stackgroup = "one",
fillcolor = "#4C74C9") %>%
add_trace(y = ~death,
name = "Death",
fillcolor = "#9E0003") %>%
layout(title = "Covid19 Daily Cases Worldwide",
legend = list(x = 0.1, y = 0.9),
yaxis = list(title = "Number of Cases"),
xaxis = list(title = "Source: Johns Hopkins University Center for Systems Science and Engineering"))
The next table provides an overview of the ten countries with the highest confirmed cases. We will use the datatable
function from the DT package to view the table:
confirmed_country <- coronavirus %>%
filter(type == "confirmed") %>%
group_by(Country.Region) %>%
summarise(total_cases = sum(cases)) %>%
mutate(perc = total_cases / sum(total_cases)) %>%
arrange(-total_cases)
confirmed_country %>%
head(10) %>%
datatable(rownames = FALSE,
colnames = c("Country", "Cases", "Perc of Total")) %>%
formatPercentage("perc", 2)
The next plot summarize the distribution of the virus outside of Chine:
coronavirus %>%
filter(type == "confirmed",
Country.Region != "China") %>%
group_by(Country.Region) %>%
summarise(total_cases = sum(cases)) %>%
arrange(-total_cases) %>%
mutate(country = factor(Country.Region, levels = Country.Region)) %>%
ungroup() %>%
plot_ly(labels = ~ country,
values = ~ total_cases,
type = "pie",
textposition = 'inside',
textinfo = 'label+percent',
insidetextfont = list(color = '#FFFFFF'),
hoverinfo = 'text',
text = ~ paste(country, "<br />",
"Number of confirmed cases: ", total_cases, sep = "")) %>%
layout(title = "Coronavirus - Confirmed Cases (Excluding China)")
Similarly, we can use the pivot_wider
function from the tidyr package (in addition to the dplyr functions we used above) to get an overview of the three types of cases (confirmed, recovered, and death). We then will use it to derive the recovery and death rate by country. As for most of the countries, there is not enough information about the results of the confirmed cases, we will filter the data for countries with at least 25 confirmed cases and above:
coronavirus %>%
filter(Country.Region != "Others") %>%
group_by(Country.Region, type) %>%
summarise(total_cases = sum(cases)) %>%
pivot_wider(names_from = type, values_from = total_cases) %>%
arrange(- confirmed) %>%
filter(confirmed >= 25) %>%
mutate(death_rate = death / confirmed) %>%
datatable(rownames = FALSE,
colnames = c("Country", "Confirmed","Death", "Death Rate")) %>%
formatPercentage("death_rate", 2)
Note that it will be misleading to make any conclusion about the recovery and death rate. As there is no detail information about:
The following plot describes the overall distribution of the total confirmed cases in China by province:
coronavirus %>%
filter(Country.Region == "China",
type == "confirmed") %>%
group_by(Province.State, type) %>%
summarise(total_cases = sum(cases)) %>%
pivot_wider(names_from = type, values_from = total_cases) %>%
arrange(- confirmed) %>%
plot_ly(labels = ~ Province.State,
values = ~confirmed,
type = 'pie',
textposition = 'inside',
textinfo = 'label+percent',
insidetextfont = list(color = '#FFFFFF'),
hoverinfo = 'text',
text = ~ paste(Province.State, "<br />",
"Number of confirmed cases: ", confirmed, sep = "")) %>%
layout(title = "Total China Confirmed Cases Dist. by Province")