This blog post proclaims and briefly describes the Python package “ExampleDatasets” for (obtaining) example datasets.
Currently, this repository contains only datasets metadata. The datasets are downloaded from the repository Rdatasets, [VAB1].
This package follows the design of the Raku package “Data::ExampleDatasets”; see [AAr1].
Usage examples
Setup
Here we load the Python packages time, pandas, and this package:
from ExampleDatasets import *
import pandas
Get a dataset by using an identifier
Here we get a dataset by using an identifier and display part of the obtained dataset:
tbl = example_dataset(itemSpec = 'Baumann')
tbl.head
<bound method NDFrame.head of Unnamed: 0 group pretest.1 pretest.2 post.test.1 post.test.2 \
0 1 Basal 4 3 5 4
1 2 Basal 6 5 9 5
2 3 Basal 9 4 5 3
3 4 Basal 12 6 8 5
4 5 Basal 16 5 10 9
.. ... ... ... ... ... ...
61 62 Strat 11 4 11 7
62 63 Strat 14 4 15 7
63 64 Strat 8 2 9 5
64 65 Strat 5 3 6 8
65 66 Strat 8 3 4 6
post.test.3
0 41
1 41
2 43
3 46
4 46
.. ...
61 48
62 49
63 33
64 45
65 42
[66 rows x 7 columns]>
Here we summarize the dataset obtained above:
tbl.describe()
| Unnamed: 0 | pretest.1 | pretest.2 | post.test.1 | post.test.2 | post.test.3 | |
|---|---|---|---|---|---|---|
| count | 66.000000 | 66.000000 | 66.000000 | 66.000000 | 66.000000 | 66.000000 |
| mean | 33.500000 | 9.787879 | 5.106061 | 8.075758 | 6.712121 | 44.015152 |
| std | 19.196354 | 3.020520 | 2.212752 | 3.393707 | 2.635644 | 6.643661 |
| min | 1.000000 | 4.000000 | 1.000000 | 1.000000 | 0.000000 | 30.000000 |
| 25% | 17.250000 | 8.000000 | 3.250000 | 5.000000 | 5.000000 | 40.000000 |
| 50% | 33.500000 | 9.000000 | 5.000000 | 8.000000 | 6.000000 | 45.000000 |
| 75% | 49.750000 | 12.000000 | 6.000000 | 11.000000 | 8.000000 | 49.000000 |
| max | 66.000000 | 16.000000 | 13.000000 | 15.000000 | 13.000000 | 57.000000 |
Remark: The values for the arguments itemSpec and packageSpec correspond to the values of the columns “Item” and “Package”, respectively, in the metadata dataset from the GitHub repository “Rdatasets”, [VAB1]. See the datasets metadata sub-section below.
Get a dataset by using an URL
Here we can find URLs of datasets that have titles adhering to a regex:
dfMeta = load_datasets_metadata()
print(dfMeta[dfMeta.Title.str.contains('^tita')][["Package", "Item", "CSV"]].to_string())
Package Item CSV
288 COUNT titanic https://vincentarelbundock.github.io/Rdatasets/csv/COUNT/titanic.csv
289 COUNT titanicgrp https://vincentarelbundock.github.io/Rdatasets/csv/COUNT/titanicgrp.csv
Here we get a dataset through pandas by using an URL and display the head of the obtained dataset:
import pandas
url = 'https://raw.githubusercontent.com/antononcube/Raku-Data-Reshapers/main/resources/dfTitanic.csv'
tbl2 = pandas.read_csv(url)
tbl2.head()
| id | passengerClass | passengerAge | passengerSex | passengerSurvival | |
|---|---|---|---|---|---|
| 0 | 1 | 1st | 30 | female | survived |
| 1 | 2 | 1st | 0 | male | survived |
| 2 | 3 | 1st | 0 | female | died |
| 3 | 4 | 1st | 30 | male | died |
| 4 | 5 | 1st | 20 | female | died |
Datasets metadata
Here we:
- Get the dataset of the datasets metadata
- Filter it to have only datasets with 13 rows
- Keep only the columns “Item”, “Title”, “Rows”, and “Cols”
- Display it
tblMeta = load_datasets_metadata()
tblMeta = tblMeta[["Item", "Title", "Rows", "Cols"]]
tblMeta = tblMeta[tblMeta["Rows"] == 13]
tblMeta
| Item | Title | Rows | Cols | |
|---|---|---|---|---|
| 805 | Snow.pumps | John Snow’s Map and Data on the 1854 London Ch… | 13 | 4 |
| 820 | BCG | BCG Vaccine Data | 13 | 7 |
| 935 | cement | Heat Evolved by Setting Cements | 13 | 5 |
| 1354 | kootenay | Waterflow Measurements of Kootenay River in Li… | 13 | 2 |
| 1644 | Newhouse77 | Medical-Care Expenditure: A Cross-National Sur… | 13 | 5 |
| 1735 | Saxony | Families in Saxony | 13 | 2 |
Keeping downloaded data
By default the data is obtained over the web from Rdatasets, but example_dataset has an option to keep the data “locally.” (The data is saved in XDG_DATA_HOME, see SS1.)
This can be demonstrated with the following timings of a dataset with ~1300 rows:
import time
startTime = time.time()
data = example_dataset(itemSpec = 'titanic', packageSpec = 'COUNT', keep = True)
endTime = time.time()
print("Getting the data first time took " + str( endTime - startTime ) + " seconds")
Getting the data first time took 0.003923892974853516 seconds
import time
startTime = time.time()
data = example_dataset(itemSpec = 'titanic', packageSpec = 'COUNT', keep = True)
endTime = time.time()
print("Geting the data second time took " + str( endTime - startTime ) + " seconds")
Geting the data second time took 0.003058910369873047 seconds
References
Functions, packages, repositories
[AAf1] Anton Antonov, ExampleDataset, (2020), Wolfram Function Repository.
[AAr1] Anton Antonov, Data::ExampleDatasets Raku package, (2021), GitHub/antononcube.
[VAB1] Vincent Arel-Bundock, Rdatasets, (2020), GitHub/vincentarelbundock.
[SS1] Scott Stevenson, xdg Python package, (2016-2021), PyPI.org.
Interactive interfaces
[AAi1] Anton Antonov, Example datasets recommender interface, (2021), Shinyapps.io.