DataTypeSystem

This blog post proclaims and briefly describes the Python package “DataTypeSystem” that provides a type system for different data structures that are coercible into full arrays. The package is a Python translation of the Raku package “Data::TypeSystem”, [AAp1].

Installation

Install from GitHub

pip install -e git+https://github.com/antononcube/Python-packages.git#egg=DataTypeSystem-antononcube\&subdirectory=DataTypeSystem

From PyPi

pip install DataTypeSystem

Usage examples

The type system conventions follow those of Mathematica’s Dataset — see the presentation “Dataset improvements”.

Here we get the Titanic dataset, change the “passengerAge” column values to be numeric, and show dataset’s dimensions:

import pandas dfTitanic = pandas.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/titanic.csv') dfTitanic = dfTitanic[["sex", "age", "pclass", "survived"]] dfTitanic = dfTitanic.rename(columns ={"pclass": "class"}) dfTitanic.shape

(891, 4)

Here is a sample of dataset’s records:

from DataTypeSystem import * dfTitanic.sample(3)

	sex	age	class
555	male	62.0	1
278	male	7.0	3
266	male	16.0	3

Here is the type of a single record:

deduce_type(dfTitanic.iloc[12].to_dict())

Struct([age, class, sex, survived], [float, int, str, int])

Here is the type of single record’s values:

deduce_type(dfTitanic.iloc[12].to_dict().values())

Tuple([Atom(<class 'str'>), Atom(<class 'float'>), Atom(<class 'int'>), Atom(<class 'int'>)])

Here is the type of the whole dataset:

deduce_type(dfTitanic.to_dict())

Assoc(Atom(<class 'str'>), Assoc(Atom(<class 'int'>), Atom(<class 'str'>), 891), 4)

Here is the type of “values only” records:

valArr = dfTitanic.transpose().to_dict().values() deduce_type(valArr)

Vector(Struct([age, class, sex, survived], [float, int, str, int]), 891)

References

[AAp1] Anton Antonov, Data::TypeSystem Raku package, (2023), GitHub/antononcube.

Tries with frequencies

Introduction

This blog post introduces and gives usage examples of the Machine Learning (ML) data structure Tries with frequencies, [AA1], creation and usage through the Python package “TriesWithFrequencies”.

For the original Trie (or Prefix tree) data structure see the Wikipedia article “Trie”.

Setup

from TriesWithFrequencies import *

Creation examples

In this section we show a few ways to create tries with frequencies.

Consider a trie (prefix tree) created over a list of words:

tr = trie_create_by_split( ["bar", "bark", "bars", "balm", "cert", "cell"] )
trie_form(tr)

TRIEROOT => 6.0
├─b => 4.0
│ └─a => 4.0
│   ├─r => 3.0
│   │ └─k => 1.0
│   │ └─s => 1.0
│   └─l => 1.0
│     └─m => 1.0
└─c => 2.0
  └─e => 2.0
    ├─r => 1.0
    │ └─t => 1.0
    └─l => 1.0
      └─l => 1.0

Here we convert the trie with frequencies above into a trie with probabilities:

ptr = trie_node_probabilities( tr )
trie_form(ptr)

TRIEROOT => 1.0
├─b => 0.6666666666666666
│ └─a => 1.0
│   ├─r => 0.75
│   │ ├─k => 0.3333333333333333
│   │ └─s => 0.3333333333333333
│   └─l => 0.25
│     └─m => 1.0
└─c => 0.3333333333333333
  └─e => 1.0
    ├─r => 0.5
    │ └─t => 1.0
    └─l => 0.5
      └─l => 1.0

Shrinking

Here we shrink the trie with probabilities above:

trie_form(trie_shrink(ptr))

TRIEROOT => 1.0
└─ba => 1.0
  └─r => 0.75
    └─k => 0.3333333333333333
    └─s => 0.3333333333333333
  └─lm => 1.0
└─ce => 1.0
  └─rt => 1.0
  └─ll => 1.0

Here we shrink the frequencies trie using a separator:

trie_form(trie_shrink(tr, sep="~"))

TRIEROOT => 6.0
└─b~a => 4.0
  └─r => 3.0
    └─k => 1.0
    └─s => 1.0
  └─l~m => 1.0
└─c~e => 2.0
  └─r~t => 1.0
  └─l~l => 1.0

Retrieval and sub-tries

Here we retrieve a sub-trie with a key:

trie_form(trie_sub_trie(tr, list("bar")))

r => 3.0
└─k => 1.0
└─s => 1.0

Classification

Create a trie:

words = [*(["bar"] * 6), *(["bark"] * 3), *(["bare"] * 2), *(["cam"] * 3), "came", *(["camelia"] * 4)]
tr = trie_create_by_split(words)
tr = trie_node_probabilities(tr)

Show node counts:

trie_node_counts(tr)

{'total': 13, 'internal': 10, 'leaves': 3}

Show the trie form:

trie_form(tr)

TRIEROOT => 1.0
├─b => 0.5789473684210527
│ └─a => 1.0
│   └─r => 1.0
│     ├─k => 0.2727272727272727
│     └─e => 0.18181818181818182
└─c => 0.42105263157894735
  └─a => 1.0
    └─m => 1.0
      └─e => 0.625
        └─l => 0.8
          └─i => 1.0
            └─a => 1.0

Classify with the letters of the word \”cam\”:

trie_classify(tr, list("cam"), prop="Probabilities")

{'a': 0.5, 'm': 0.375, 'e': 0.12499999999999997}

References

Articles

[AA1] Anton Antonov, “Tries with frequencies for data mining”, (2013), MathematicaForPrediction at WordPress.

[AA2] Anton Antonov, “Removal of sub-trees in tries”, (2013), MathematicaForPrediction at WordPress.

[AA3] Anton Antonov, “Tries with frequencies in Java” (2017), MathematicaForPrediction at WordPress. GitHub Markdown.

[WK1] Wikipedia entry, Trie.

Packages

[AAp1] Anton Antonov, Tries with frequencies Mathematica Version 9.0 package, (2013), MathematicaForPrediction at GitHub.

[AAp2] Anton Antonov, Tries with frequencies Mathematica package, (2013-2018), MathematicaForPrediction at GitHub.

[AAp3] Anton Antonov, Tries with frequencies in Java, (2017), MathematicaForPrediction at GitHub.

[AAp4] Anton Antonov, Java tries with frequencies Mathematica package, (2017), MathematicaForPrediction at GitHub.

[AAp5] Anton Antonov, Java tries with frequencies Mathematica unit tests, (2017), MathematicaForPrediction at GitHub.

[AAp6] Anton Antonov, ML::TriesWithFrequencies Raku package, (2021), GitHub/antononcube.

Videos

[AAv1] Anton Antonov, “Prefix Trees with Frequencies for Data Analysis and Machine Learning”, (2017), Wolfram Technology Conference 2017, Wolfram channel at YouTube.

Facing data with Chernoff faces

Introduction

This blog post proclaims the Python package “ChernoffFace” and outlines and exemplifies its function chernoff_face that generates Chernoff diagrams.

The design, implementation strategy, and unit tests closely resemble the Wolfram Repository Function (WFR) ChernoffFace, [AAf1], and the original Mathematica package “ChernoffFaces.m”, [AAp1].

Installation

To install from GitHub use the shell command:

python -m pip install git+https://github.com/antononcube/Python-packages.git#egg=ChernoffFace\&subdirectory=ChernoffFace

To install from PyPI:

python -m pip install ChernoffFace

Usage examples

Setup

from ChernoffFace import *
import numpy
import matplotlib.cm

Random data

# Generate data
numpy.random.seed(32)
data = numpy.random.rand(16, 12)

# Make Chernoff faces
fig = chernoff_face(data=data,
                    titles=[str(x) for x in list(range(len(data)))],
                    color_mapper=matplotlib.cm.Pastel1)

Employee attitude data

Get Employee attitude data

dfData=load_employee_attitude_data_frame()
dfData.head()

	Rating	Complaints	Privileges	Learning	Raises	Critical	Advancement
0	43	51	30	39	61	92	45
1	63	64	51	54	63	73	47
2	71	70	68	69	76	86	48
3	61	63	45	47	54	84	35
4	81	78	56	66	71	83	47

Rescale the variables:

dfData2 = variables_rescale(dfData)
dfData2.head()

	Rating	Complaints	Privileges	Learning	Raises	Critical	Advancement
0	0.066667	0.264151	0.000000	0.121951	0.400000	1.000000	0.425532
1	0.511111	0.509434	0.396226	0.487805	0.444444	0.558140	0.468085
2	0.688889	0.622642	0.716981	0.853659	0.733333	0.860465	0.489362
3	0.466667	0.490566	0.283019	0.317073	0.244444	0.813953	0.212766
4	0.911111	0.773585	0.490566	0.780488	0.622222	0.790698	0.468085

Make the corresponding Chernoff faces:

fig = chernoff_face(data=dfData2,
                    n_columns=5,
                    long_face=False,
                    color_mapper=matplotlib.cm.tab20b,
                    figsize=(8, 8), dpi=200)

USA arrests data

Get USA arrests data:

dfData=load_usa_arrests_data_frame()
dfData.head()

	StateName	Murder	Assault	UrbanPopulation	Rape
0	Alabama	13.2	236	58	21.2
1	Alaska	10.0	263	48	44.5
2	Arizona	8.1	294	80	31.0
3	Arkansas	8.8	190	50	19.5
4	California	9.0	276	91	40.6

Rescale the variables:

dfData2 = variables_rescale(dfData)
dfData2.head()

	StateName	Murder	Assault	UrbanPopulation	Rape
0	Alabama	0.746988	0.654110	0.440678	0.359173
1	Alaska	0.554217	0.746575	0.271186	0.961240
2	Arizona	0.439759	0.852740	0.813559	0.612403
3	Arkansas	0.481928	0.496575	0.305085	0.315245
4	California	0.493976	0.791096	1.000000	0.860465

Make the corresponding Chernoff faces using USA state names as titles:

fig = chernoff_face(data=dfData2,
                    n_columns=5,
                    long_face=False,
                    color_mapper=matplotlib.cm.tab20c_r,
                    figsize=(12, 12), dpi=200)

References

Articles

[AA1] Anton Antonov, “Making Chernoff faces for data visualization”, (2016), MathematicaForPrediction at WordPress.

Functions and packages

[AAf1] Anton Antonov, ChernoffFace, (2019), Wolfram Function Repository.

[AAp1] Anton Antonov, Chernoff faces implementation in Mathematica, (2016), MathematicaForPrediction at GitHub.

Python for Prediction

Python compared to Mathematica and R.

Menu

Tag Archives: Data Analysis

DataTypeSystem

Installation

Install from GitHub

From PyPi

Usage examples

References

Tries with frequencies

Introduction

Setup

Creation examples

Shrinking

Retrieval and sub-tries

Classification

References

Articles

Packages

Videos

Facing data with Chernoff faces

Introduction

Installation

Usage examples

Setup

Random data

Employee attitude data

USA arrests data

References

Articles

Functions and packages

	Rating	Complaints	Privileges	Learning	Raises	Critical	Advancement
0	43	51	30	39	61	92	45
1	63	64	51	54	63	73	47
2	71	70	68	69	76	86	48
3	61	63	45	47	54	84	35
4	81	78	56	66	71	83	47

	Rating	Complaints	Privileges	Learning	Raises	Critical	Advancement
0	43	51	30	39	61	92	45
1	63	64	51	54	63	73	47
2	71	70	68	69	76	86	48
3	61	63	45	47	54	84	35
4	81	78	56	66	71	83	47

	Rating	Complaints	Privileges	Learning	Raises	Critical	Advancement
0	43	51	30	39	61	92	45
1	63	64	51	54	63	73	47
2	71	70	68	69	76	86	48
3	61	63	45	47	54	84	35
4	81	78	56	66	71	83	47