It’s a very quick post on how to get a list of datasets available from within R with their basic description (what package they can be found in, number of observations and variables). It always takes me some time to find the right dataset to showcase whatever process or method I’m working with, so this was really to make my life easier. So! I’m going to scrape the table with a list of R datasets from here using rvest
and xml2
packages:
# loading packages
library(rvest)
library(xml2)
library(dplyr)
library(knitr)
# URL to scrape
url <- "https://vincentarelbundock.github.io/Rdatasets/datasets.html"
# scrape the table with relevant info
r_datasets <- read_html(url) %>% # read url
html_nodes("table") %>% # extract all the tables
.[[2]] %>% # it's the second table we want
html_table() # convert it to a usable format (data.frame)
As a result, we get a tidy data frame…
str(r_datasets)
## 'data.frame': 1162 obs. of 11 variables:
## $ Package : chr "boot" "boot" "boot" "boot" ...
## $ Item : chr "acme" "aids" "aircondit" "aircondit7" ...
## $ Title : chr "Monthly Excess Returns" "Delay in AIDS Reporting in England and Wales" "Failures of Air-conditioning Equipment" "Failures of Air-conditioning Equipment" ...
## $ Rows : int 60 570 12 24 8437 23 100 49 823 10 ...
## $ Cols : int 3 6 1 1 4 3 4 2 3 5 ...
## $ has_logical : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ has_binary : logi FALSE TRUE FALSE FALSE TRUE TRUE ...
## $ has_numeric : logi TRUE TRUE TRUE TRUE TRUE TRUE ...
## $ has_character: logi TRUE FALSE FALSE FALSE FALSE FALSE ...
## $ CSV : chr "CSV" "CSV" "CSV" "CSV" ...
## $ Doc : chr "DOC" "DOC" "DOC" "DOC" ...
r_datasets %>%
select(-c(CSV, Doc)) %>%
head()
## Package Item Title Rows
## 1 boot acme Monthly Excess Returns 60
## 2 boot aids Delay in AIDS Reporting in England and Wales 570
## 3 boot aircondit Failures of Air-conditioning Equipment 12
## 4 boot aircondit7 Failures of Air-conditioning Equipment 24
## 5 boot amis Car Speeding and Warning Signs 8437
## 6 boot aml Remission Times for Acute Myelogenous Leukaemia 23
## Cols has_logical has_binary has_numeric has_character
## 1 3 FALSE FALSE TRUE TRUE
## 2 6 FALSE TRUE TRUE FALSE
## 3 1 FALSE FALSE TRUE FALSE
## 4 1 FALSE FALSE TRUE FALSE
## 5 4 FALSE TRUE TRUE FALSE
## 6 3 FALSE TRUE TRUE FALSE
.. that we can filter freely, according to our needs:
r_datasets %>% filter(Rows >= 1000 & Cols >= 50) %>%
kable()
Package | Item | Title | Rows | Cols | has_logical | has_binary | has_numeric | has_character | CSV | Doc |
---|---|---|---|---|---|---|---|---|---|---|
Ecdat | Car | Stated Preferences for Car Choice | 4654 | 70 | FALSE | TRUE | TRUE | FALSE | CSV | DOC |
ISLR | Caravan | The Insurance Company (TIC) Benchmark | 5822 | 86 | FALSE | TRUE | TRUE | FALSE | CSV | DOC |
mosaicData | HELPfull | Health Evaluation and Linkage to Primary Care | 1472 | 788 | FALSE | TRUE | TRUE | FALSE | CSV | DOC |
psych | epi | Eysenck Personality Inventory (EPI) data for 3570 participants | 3570 | 57 | FALSE | TRUE | FALSE | FALSE | CSV | DOC |
psych | msq | 75 mood items from the Motivational State Questionnaire for 3896 participants | 3896 | 92 | FALSE | TRUE | TRUE | FALSE | CSV | DOC |
psych | msqR | 75 mood items from the Motivational State Questionnaire for 3032 unique participants | 6411 | 79 | FALSE | TRUE | TRUE | TRUE | CSV | DOC |
psych | spi | A sample from the SAPA Personality Inventory including an item dictionary and scoring keys. | 4000 | 145 | FALSE | TRUE | TRUE | FALSE | CSV | DOC |
r_datasets %>% filter(grepl("cat", Item)) %>% kable()
Package | Item | Title | Rows | Cols | has_logical | has_binary | has_numeric | has_character | CSV | Doc |
---|---|---|---|---|---|---|---|---|---|---|
boot | catsM | Weight Data for Domestic Cats | 97 | 3 | FALSE | FALSE | TRUE | FALSE | CSV | DOC |
MASS | cats | Anatomical Data from Domestic Cats | 144 | 3 | FALSE | TRUE | TRUE | FALSE | CSV | DOC |
psych | cattell | 12 cognitive variables from Cattell (1963) | 12 | 12 | FALSE | FALSE | TRUE | FALSE | CSV | DOC |
robustbase | education | Education Expenditure Data | 50 | 6 | FALSE | FALSE | FALSE | FALSE | CSV | DOC |
This totally maked my life easier, so hope it will help you, too!
comments powered by Disqus