Scraping Online Table With Info on R datasets

It’s a very quick post on how to get a list of datasets available from within R with their basic description (what package they can be found in, number of observations and variables). It always takes me some time to find the right dataset to showcase whatever process or method I’m working with, so this was really to make my life easier. So! I’m going to scrape the table with a list of R datasets from here using rvest and xml2 packages:

# loading packages
library(rvest)
library(xml2)
library(dplyr)
library(knitr)

# URL to scrape
url <- "https://vincentarelbundock.github.io/Rdatasets/datasets.html"

# scrape the table with relevant info
r_datasets <- read_html(url) %>% # read url
    html_nodes("table") %>% # extract all the tables
   .[[2]] %>% # it's the second table we want
    html_table() # convert it to a usable format (data.frame)

As a result, we get a tidy data frame…

str(r_datasets)
## 'data.frame':    1162 obs. of  11 variables:
##  $ Package      : chr  "boot" "boot" "boot" "boot" ...
##  $ Item         : chr  "acme" "aids" "aircondit" "aircondit7" ...
##  $ Title        : chr  "Monthly Excess Returns" "Delay in AIDS Reporting in England and Wales" "Failures of Air-conditioning Equipment" "Failures of Air-conditioning Equipment" ...
##  $ Rows         : int  60 570 12 24 8437 23 100 49 823 10 ...
##  $ Cols         : int  3 6 1 1 4 3 4 2 3 5 ...
##  $ has_logical  : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ has_binary   : logi  FALSE TRUE FALSE FALSE TRUE TRUE ...
##  $ has_numeric  : logi  TRUE TRUE TRUE TRUE TRUE TRUE ...
##  $ has_character: logi  TRUE FALSE FALSE FALSE FALSE FALSE ...
##  $ CSV          : chr  "CSV" "CSV" "CSV" "CSV" ...
##  $ Doc          : chr  "DOC" "DOC" "DOC" "DOC" ...
r_datasets %>% 
  select(-c(CSV, Doc)) %>% 
  head() 
##   Package       Item                                           Title Rows
## 1    boot       acme                          Monthly Excess Returns   60
## 2    boot       aids    Delay in AIDS Reporting in England and Wales  570
## 3    boot  aircondit          Failures of Air-conditioning Equipment   12
## 4    boot aircondit7          Failures of Air-conditioning Equipment   24
## 5    boot       amis                  Car Speeding and Warning Signs 8437
## 6    boot        aml Remission Times for Acute Myelogenous Leukaemia   23
##   Cols has_logical has_binary has_numeric has_character
## 1    3       FALSE      FALSE        TRUE          TRUE
## 2    6       FALSE       TRUE        TRUE         FALSE
## 3    1       FALSE      FALSE        TRUE         FALSE
## 4    1       FALSE      FALSE        TRUE         FALSE
## 5    4       FALSE       TRUE        TRUE         FALSE
## 6    3       FALSE       TRUE        TRUE         FALSE

.. that we can filter freely, according to our needs:

r_datasets %>% filter(Rows >= 1000 & Cols >= 50) %>% 
  kable()
Package Item Title Rows Cols has_logical has_binary has_numeric has_character CSV Doc
Ecdat Car Stated Preferences for Car Choice 4654 70 FALSE TRUE TRUE FALSE CSV DOC
ISLR Caravan The Insurance Company (TIC) Benchmark 5822 86 FALSE TRUE TRUE FALSE CSV DOC
mosaicData HELPfull Health Evaluation and Linkage to Primary Care 1472 788 FALSE TRUE TRUE FALSE CSV DOC
psych epi Eysenck Personality Inventory (EPI) data for 3570 participants 3570 57 FALSE TRUE FALSE FALSE CSV DOC
psych msq 75 mood items from the Motivational State Questionnaire for 3896 participants 3896 92 FALSE TRUE TRUE FALSE CSV DOC
psych msqR 75 mood items from the Motivational State Questionnaire for 3032 unique participants 6411 79 FALSE TRUE TRUE TRUE CSV DOC
psych spi A sample from the SAPA Personality Inventory including an item dictionary and scoring keys. 4000 145 FALSE TRUE TRUE FALSE CSV DOC
r_datasets %>% filter(grepl("cat", Item)) %>% kable()
Package Item Title Rows Cols has_logical has_binary has_numeric has_character CSV Doc
boot catsM Weight Data for Domestic Cats 97 3 FALSE FALSE TRUE FALSE CSV DOC
MASS cats Anatomical Data from Domestic Cats 144 3 FALSE TRUE TRUE FALSE CSV DOC
psych cattell 12 cognitive variables from Cattell (1963) 12 12 FALSE FALSE TRUE FALSE CSV DOC
robustbase education Education Expenditure Data 50 6 FALSE FALSE FALSE FALSE CSV DOC

This totally maked my life easier, so hope it will help you, too!

 
comments powered by Disqus