Raku package for (obtaining) example datasets.
Currently, this repository contains only datasets metadata. The datasets are downloaded from the repository Rdatasets, [VAB1].
Here we load the Raku modules
Data::Generators,
Data::Summarizers,
and this module,
Data::ExampleDatasets:
use Data::Reshapers;
use Data::Summarizers;
use Data::ExampleDatasets;# (Any)
Here we get a dataset by using an identifier and display part of the obtained dataset:
my @tbl = example-dataset('Baumann', :headers);
say to-pretty-table(@tbl[^6]);# +-------+-------------+-----------+-------------+----------+-----------+-------------+
# | group | post.test.2 | pretest.1 | post.test.3 | rownames | pretest.2 | post.test.1 |
# +-------+-------------+-----------+-------------+----------+-----------+-------------+
# | Basal | 4 | 4 | 41 | 1 | 3 | 5 |
# | Basal | 5 | 6 | 41 | 2 | 5 | 9 |
# | Basal | 3 | 9 | 43 | 3 | 4 | 5 |
# | Basal | 5 | 12 | 46 | 4 | 6 | 8 |
# | Basal | 9 | 16 | 46 | 5 | 5 | 10 |
# | Basal | 8 | 15 | 45 | 6 | 13 | 9 |
# +-------+-------------+-----------+-------------+----------+-----------+-------------+
Here we summarize the dataset obtained above:
records-summary(@tbl)# +-------------+--------------------+--------------------+----------------+--------------------+--------------------+---------------------+
# | group | pretest.1 | post.test.1 | rownames | pretest.2 | post.test.2 | post.test.3 |
# +-------------+--------------------+--------------------+----------------+--------------------+--------------------+---------------------+
# | Basal => 22 | Min => 4 | Min => 1 | Min => 1 | Min => 1 | Min => 0 | Min => 30 |
# | Strat => 22 | 1st-Qu => 8 | 1st-Qu => 5 | 1st-Qu => 17 | 1st-Qu => 3 | 1st-Qu => 5 | 1st-Qu => 40 |
# | DRTA => 22 | Mean => 9.787879 | Mean => 8.075758 | Mean => 33.5 | Mean => 5.106061 | Mean => 6.712121 | Mean => 44.015152 |
# | | Median => 9 | Median => 8 | Median => 33.5 | Median => 5 | Median => 6 | Median => 45 |
# | | 3rd-Qu => 12 | 3rd-Qu => 11 | 3rd-Qu => 50 | 3rd-Qu => 6 | 3rd-Qu => 8 | 3rd-Qu => 49 |
# | | Max => 16 | Max => 15 | Max => 66 | Max => 13 | Max => 13 | Max => 57 |
# +-------------+--------------------+--------------------+----------------+--------------------+--------------------+---------------------+
Remark: The values for the first argument of example-dataset correspond to the values
of the columns "Item" and "Package", respectively, in theA
metadata dataset
from the GitHub repository "Rdatasets", [VAB1].
See the datasets metadata sub-section below.
The first argument of example-dataset can take as values:
-
Strings that correspond to the column "Items" of the metadata dataset
- E.g.
example-dataset("mtcars")
- E.g.
-
Strings that correspond to the columns "Package" and "Items" of the metadata dataset
- E.g.
example-dataset("COUNT::titanic")
- E.g.
-
Regexes
- E.g.
example-dataset(/ .* mann $ /)
- E.g.
-
WhateverorWhateverCode
Here we get a dataset by using an URL and display a summary of the obtained dataset:
my $url = 'https://raw.githubusercontent.com/antononcube/Raku-Data-Reshapers/main/resources/dfTitanic.csv';
my @tbl2 = example-dataset($url, :headers);
records-summary(@tbl2, field-names => <id passengerSex passengerClass passengerAge passengerSurvival>);# +-----------------+---------------+----------------+---------------------+-------------------+
# | id | passengerSex | passengerClass | passengerAge | passengerSurvival |
# +-----------------+---------------+----------------+---------------------+-------------------+
# | Min => 1 | male => 843 | 3rd => 709 | Min => -1 | died => 809 |
# | 1st-Qu => 327.5 | female => 466 | 1st => 323 | 1st-Qu => 10 | survived => 500 |
# | Mean => 655 | | 2nd => 277 | Mean => 23.550038 | |
# | Median => 655 | | | Median => 20 | |
# | 3rd-Qu => 982.5 | | | 3rd-Qu => 40 | |
# | Max => 1309 | | | Max => 80 | |
# +-----------------+---------------+----------------+---------------------+-------------------+
Here we:
- Get the dataset of the datasets metadata
- Filter it to have only datasets with 13 rows
- Keep only the columns "Item", "Title", "Rows", and "Cols"
- Display it in "pretty table" format
my @tblMeta = get-datasets-metadata();
@tblMeta = @tblMeta.grep({ $_<Rows> == 13}).map({ $_.grep({ $_.key (elem) <Item Title Rows Cols>}).Hash });
say to-pretty-table(@tblMeta, field-names => <Item Title Rows Cols>)# +------------+--------------------------------------------------------------------+------+------+
# | Item | Title | Rows | Cols |
# +------------+--------------------------------------------------------------------+------+------+
# | Snow.pumps | John Snow's Map and Data on the 1854 London Cholera Outbreak | 13 | 4 |
# | BCG | BCG Vaccine Data | 13 | 7 |
# | cement | Heat Evolved by Setting Cements | 13 | 5 |
# | kootenay | Waterflow Measurements of Kootenay River in Libby and Newgate | 13 | 2 |
# | Newhouse77 | Medical-Care Expenditure: A Cross-National Survey (Newhouse, 1977) | 13 | 5 |
# | Saxony | Families in Saxony | 13 | 2 |
# +------------+--------------------------------------------------------------------+------+------+
By default the data is obtained over the web from
Rdatasets,
but example-dataset has an option to keep the data "locally."
(The data is saved in XDG_DATA_HOME, see
[JS1].)
This can be demonstrated with the following timings of a dataset with ~1300 rows:
my $startTime = now;
my $data = example-dataset( / 'COUNT::titanic' $ / ):keep;
my $endTime = now;
say "Geting the data first time took { $endTime - $startTime } seconds";# Geting the data first time took 0.76011044 seconds
$startTime = now;
$data = example-dataset( / 'COUNT::titanic' $/ ):keep;
$endTime = now;
say "Geting the data second time took { $endTime - $startTime } seconds";# Geting the data second time took 0.764633055 seconds
[AAf1] Anton Antonov,
ExampleDataset,
(2020),
Wolfram Function Repository.
[VAB1] Vincent Arel-Bundock, Rdatasets, (2020), GitHub/vincentarelbundock.
[JS1] Jonathan Stowe,
XDG::BaseDirectory,
(last updated on 2021-03-31),
Raku Modules.
[AAi1] Anton Antonov, Example datasets recommender interface, (2021), Shinyapps.io.