DataAcquisition | Python for Prediction

This blog post proclaims and briefly describes the Python package “ExampleDatasets” for (obtaining) example datasets.

Currently, this repository contains only datasets metadata. The datasets are downloaded from the repository Rdatasets, [VAB1].

This package follows the design of the Raku package “Data::ExampleDatasets”; see [AAr1].

Usage examples

Setup

Here we load the Python packages time, pandas, and this package:

from ExampleDatasets import *
import pandas

Get a dataset by using an identifier

Here we get a dataset by using an identifier and display part of the obtained dataset:

tbl = example_dataset(itemSpec = 'Baumann')
tbl.head

<bound method NDFrame.head of     Unnamed: 0  group  pretest.1  pretest.2  post.test.1  post.test.2  \
0            1  Basal          4          3            5            4   
1            2  Basal          6          5            9            5   
2            3  Basal          9          4            5            3   
3            4  Basal         12          6            8            5   
4            5  Basal         16          5           10            9   
..         ...    ...        ...        ...          ...          ...   
61          62  Strat         11          4           11            7   
62          63  Strat         14          4           15            7   
63          64  Strat          8          2            9            5   
64          65  Strat          5          3            6            8   
65          66  Strat          8          3            4            6   

    post.test.3  
0            41  
1            41  
2            43  
3            46  
4            46  
..          ...  
61           48  
62           49  
63           33  
64           45  
65           42  

[66 rows x 7 columns]>

Here we summarize the dataset obtained above:

tbl.describe()

	Unnamed: 0	pretest.1	pretest.2	post.test.1	post.test.2	post.test.3
count	66.000000	66.000000	66.000000	66.000000	66.000000	66.000000
mean	33.500000	9.787879	5.106061	8.075758	6.712121	44.015152
std	19.196354	3.020520	2.212752	3.393707	2.635644	6.643661
min	1.000000	4.000000	1.000000	1.000000	0.000000	30.000000
25%	17.250000	8.000000	3.250000	5.000000	5.000000	40.000000
50%	33.500000	9.000000	5.000000	8.000000	6.000000	45.000000
75%	49.750000	12.000000	6.000000	11.000000	8.000000	49.000000
max	66.000000	16.000000	13.000000	15.000000	13.000000	57.000000

Remark: The values for the arguments itemSpec and packageSpec correspond to the values of the columns “Item” and “Package”, respectively, in the metadata dataset from the GitHub repository “Rdatasets”, [VAB1]. See the datasets metadata sub-section below.

Get a dataset by using an URL

Here we can find URLs of datasets that have titles adhering to a regex:

dfMeta = load_datasets_metadata()
print(dfMeta[dfMeta.Title.str.contains('^tita')][["Package", "Item", "CSV"]].to_string())

    Package        Item                                                                      CSV
288   COUNT     titanic     https://vincentarelbundock.github.io/Rdatasets/csv/COUNT/titanic.csv
289   COUNT  titanicgrp  https://vincentarelbundock.github.io/Rdatasets/csv/COUNT/titanicgrp.csv

Here we get a dataset through pandas by using an URL and display the head of the obtained dataset:

import pandas
url = 'https://raw.githubusercontent.com/antononcube/Raku-Data-Reshapers/main/resources/dfTitanic.csv'
tbl2 = pandas.read_csv(url)
tbl2.head()

	id	passengerClass	passengerAge	passengerSex	passengerSurvival
0	1	1st	30	female	survived
1	2	1st	0	male	survived
2	3	1st	0	female	died
3	4	1st	30	male	died
4	5	1st	20	female	died

Datasets metadata

Here we:

Get the dataset of the datasets metadata
Filter it to have only datasets with 13 rows
Keep only the columns “Item”, “Title”, “Rows”, and “Cols”
Display it

tblMeta = load_datasets_metadata()
tblMeta = tblMeta[["Item", "Title", "Rows", "Cols"]]
tblMeta = tblMeta[tblMeta["Rows"] == 13]
tblMeta

	Item	Title	Rows	Cols
805	Snow.pumps	John Snow’s Map and Data on the 1854 London Ch…	13	4
820	BCG	BCG Vaccine Data	13	7
935	cement	Heat Evolved by Setting Cements	13	5
1354	kootenay	Waterflow Measurements of Kootenay River in Li…	13	2
1644	Newhouse77	Medical-Care Expenditure: A Cross-National Sur…	13	5
1735	Saxony	Families in Saxony	13	2

Keeping downloaded data

By default the data is obtained over the web from Rdatasets, but example_dataset has an option to keep the data “locally.” (The data is saved in XDG_DATA_HOME, see SS1.)

This can be demonstrated with the following timings of a dataset with ~1300 rows:

import time
startTime = time.time()
data = example_dataset(itemSpec = 'titanic', packageSpec = 'COUNT', keep = True)
endTime = time.time()
print("Getting the data first time took " + str( endTime - startTime ) + " seconds")

Getting the data first time took 0.003923892974853516 seconds

import time
startTime = time.time()
data = example_dataset(itemSpec = 'titanic', packageSpec = 'COUNT', keep = True)
endTime = time.time()
print("Geting the data second time took " + str( endTime - startTime ) + " seconds")

Geting the data second time took 0.003058910369873047 seconds

References

Functions, packages, repositories

[AAf1] Anton Antonov, ExampleDataset, (2020), Wolfram Function Repository.

[AAr1] Anton Antonov, Data::ExampleDatasets Raku package, (2021), GitHub/antononcube.

[VAB1] Vincent Arel-Bundock, Rdatasets, (2020), GitHub/vincentarelbundock.

[SS1] Scott Stevenson, xdg Python package, (2016-2021), PyPI.org.

Interactive interfaces

[AAi1] Anton Antonov, Example datasets recommender interface, (2021), Shinyapps.io.

Introduction

This blog post proclaims and briefly describes the Python package “RandomDataGenerators” that has functions for generating random strings, words, pet names, and (tabular) data frames.

The full list of features and development status can be found in the org-mode file Random-data-generators-work-plan.org.

Motivation

The primary motivation for this package is to have simple, intuitively named functions for generating random vectors (lists) and data frames of different objects.

Although, Python has support of random vector generation, it is assumed that commands like the following are easier to use:

random_string(6, chars = 4, pattern = "[\l\d]")

Installation

To install from GitHub use the shell command:

python -m pip install git+https://github.com/antononcube/Python-packages.git#egg=RandomDataGenerators\&subdirectory=RandomDataGenerators

To install from PyPi.org:

python -m pip install RandomDataGenerators

Setup

from RandomDataGenerators import *

The import command above is equivalent to the import commands:

from RandomDataGenerators.RandomDataFrameGenerator import random_data_frame
from RandomDataGenerators.RandomFunctions import random_string
from RandomDataGenerators.RandomFunctions import random_word
from RandomDataGenerators.RandomFunctions import random_pet_name
from RandomDataGenerators.RandomFunctions import random_pretentious_job_title

We are also going to use the packages random, numpy, and pandas:

import random
import numpy
import pandas
pandas.set_option('display.max_columns', None)

Random strings

The function random_string generates random strings. (It is based on the package StringGenerator, \[PW1\].)

Here we generate a vector of random strings with length 4 and characters that belong to specified ranges:

random_string(6, chars=4, pattern = "[\d]") # digits only

## ['3749', '4572', '9812', '7395', '2388', '7625']

random_string(6, chars=4, pattern = "[\l]") # letters only

## ['FhSd', 'DNSu', 'YggC', 'ajqA', 'dIBt', 'Mjdc']

random_string(6, chars=4, pattern = "[\l\d]") # both digits and letters

## ['yp4u', '2Shk', 'pvpS', 'M43O', 'm5SX', 'It3L']

Random words

The function random_word generates random words.

Here we generate a list with 12 random words:

random_word(12)

## ['arteria', 'Sauria', 'mentation', 'elope', 'expositor', 'planetarium', 'agglutinin', 'Faunus', 'flab', 'slub', 'Chasidic', 'Jirrbal']

Here we generate a table of random words of different types (kinds):

dfWords = pandas.DataFrame({k: random_word(6, kind = k) for k in ["Any", "Common", "Known", "Stop"]})
print(dfWords.transpose().to_string())

##                0              1          2                 3            4              5
## Any     stuffing  mind-altering    angrily        Embothrium       sorbet        smoking
## Common    reason       mackerel  alignment        calculator     halfback      paranoiac
## Known     tannoy    double-date    deckled  gynandromorphous  gravitative  steganography
## Stop       about              N      noone              next         back          alone

Remark: None can be used instead of 'Any'.

Random pet names

The function random_pet_name generates random pet names.

The pet names are taken from publicly available data of pet license registrations in the years 2015–2020 in Seattle, WA, USA. See \[DG1\].

The following command generates a list of six random pet names:

random.seed(32)
random_pet_name(6)

## ['Oskar', 'Bilbo "Bobo" Waggins', 'Maximus', 'Gracie', 'Osa', 'Fabio']

The named argument species can be used to specify specie of the random pet names. (According to the specie-name relationships in \[DG1\].)

Here we generate a table of random pet names of different species:

dfPetNames = pandas.DataFrame({ wt: random_pet_name(6, species = wt) for wt in ["Any", "Cat", "Dog", "Goat", "Pig"] })
dfPetNames.transpose()

##             0                1         2        3          4         5
## Any     Lumen             Asha      Echo     Yuki    Francis   Charlie
## Cat     Ellie      Roxie Grace    Norman     Bean  Mr. Darcy  Hermione
## Dog   Brewski            Matzo      Joey    K. C.      Oscar    Gracie
## Goat     Lula  Brussels Sprout     Grace   Moppet     Frosty      Arya
## Pig    Millie         Guinness  Guinness  Atticus   Guinness    Millie

Remark: None can be used instead of 'Any'.

The named argument weighted can be used to specify random pet name choice based on known real-life number of occurrences:

random.seed(32);
random_pet_name(6, weighted=True)

## ['Zorro', 'Beeker', 'Lucy', 'Blanco', 'Winston', 'Petunia']

The weights used correspond to the counts from \[DG1\].

Remark: The implementation of random-pet-name is based on the Mathematica implementation RandomPetName, \[AAf1\].

Random pretentious job titles

The function random_pretentious_job_title generates random pretentious job titles.

The following command generates a list of six random pretentious job titles:

random_pretentious_job_title(6)

## ['Direct Identity Officer', 'District Group Synergist', 'Lead Brand Liason', 'Central Configuration Administrator', 'Senior Accountability Facilitator', 'Dynamic Web Producer']

The named argument number_of_words can be used to control the number of words in the generated job titles.

The named argument language can be used to control in which language the generated job titles are in. At this point, only Bulgarian and English are supported.

Here we generate pretentious job titles using different languages and number of words per title:

random.seed(2)
random_pretentious_job_title(12, number_of_words = None, language = None)

## ['Manager', 'Клиентов Асистент на Инфраструктурата', 'Customer Quality Strategist', 'Наследствен Анализатор по Идентичност', 'Administrator', 'Изпълнител на Фактори', 'Administrator', 'Architect', 'Investor Assurance Agent', 'Прогресивен Служител по Сигурност', 'Координатор', 'Анализатор по Оптимизация']

Remark: None can be used as values for the named arguments number_of_words and language.

Remark: The implementation uses the job title phrases in https://www.bullshitjob.com . It is, more-or-less, based on the Mathematica implementation RandomPretentiousJobTitle, \[AAf2\].

Random tabular datasets

The function random_data_frame can be used generate tabular data frames.

Remark: In this package a data frame is an object produced and manipulated by the package pandas.

Here are basic calls:

random_data_frame()
random_data_frame(None, row_names=True)
random_data_frame(None, None)
random_data_frame(12, 4)
random_data_frame(None, 4)
random_data_frame(5, None, column_names_generator = random_pet_name)
random_data_frame(15, 5, generators = [random_pet_name, random_string, random_pretentious_job_title])
random_data_frame(None, ["Col1", "Col2", "Col3"], row-names=False)

Here is example of a generated data frame with column names that are cat pet names:

random_data_frame(5, 4, column_names_generator = lambda size: random_pet_name(size, species = 'Cat'), row_names=True)

##          Meryl   Oreo  Douglas Fur Sprockett
## id.0 -1.053990  QhFlT            0     o7p5f
## id.1 -0.707621  G90kh            0     yBupF
## id.2  0.494162  eMVtF            0     Ez2Df
## id.3  0.400718  tx3HL            2     3Tz7I
## id.4 -1.345948  r3NRa            0     whfam

Remark: Both wide format and long format data frames can be generated.

Remark: The signature design and implementation are based on the Mathematica implementation RandomTabularDataset, \[AAf3\]. There are also corresponding packages written in R, \[AAp1\], and Raku, \[AAp2\].

Here is an example in which some of the columns have specified generators:

random.seed(66)
random_data_frame(10, 
                  ["alpha", "beta", "gamma", "zetta", "omega"], 
                  generators = {"alpha" : random_pet_name, 
                                "beta" :  numpy.random.normal, 
                                "gamma" : lambda size: numpy.random.poisson(lam=5, size=size) } )

##       alpha      beta  gamma  zetta             omega
## 0    Frayda  0.811681      4  1V05P             swing
## 1     Rosie  0.591327      3  tg7yn           Carolus
## 2      Jovi  0.563906      7  imaDl            sailor
## 3     Pilot  0.607250      7  WAg8u           echinus
## 4    Brodie  0.279003     12  yXEao          Ramayana
## 5  Springer -1.394703      5  JFBoz            simper
## 6       Uma -0.538088      8  7ATV1        consecrate
## 7      Diva  0.343234      4  GeJUh            blight
## 8    Fezzik  1.506241      6  yEPI5  misappropriation
## 9      Hana -1.359908      4  PG3IS          diploidy

References

Articles

[AA1] Anton Antonov, “Pets licensing data analysis”, (2020), MathematicaForPrediction at WordPress.

Functions, packages

[AAf1] Anton Antonov, RandomPetName, (2021), Wolfram Function Repository.

[AAf2] Anton Antonov, RandomPretentiousJobTitle, (2021), Wolfram Function Repository.

[AAf3] Anton Antonov, RandomTabularDataset, (2021), Wolfram Function Repository.

[AAp1] Anton Antonov, RandomDataFrameGenerator R package, (2020), R-packages at GitHub/antononcube.

[AAp2] Anton Antonov, Data::Generators Raku package, (2021), Raku Modules.

[PW1] Paul Wolf, StringGenerator Python package, (PyPi.org)(https://pypi.org).

[WRI1] Wolfram Research (2010), RandomVariate, Wolfram Language function.

Data repositories

[DG1] Data.Gov, Seattle Pet Licenses, catalog.data.gov.

Python for Prediction

Python compared to Mathematica and R.

Menu

Tag Archives: DataAcquisition

Example datasets retrieval

Usage examples

Setup

Get a dataset by using an identifier

Get a dataset by using an URL

Datasets metadata

Keeping downloaded data

References

Functions, packages, repositories

Interactive interfaces

Random Data Generators

Introduction

Motivation

Installation

Setup

Random strings

Random words

Random pet names

Random pretentious job titles

Random tabular datasets

References

Articles

Functions, packages

Data repositories