Image

Rapid specification of regression workflows

… aka Regressionizer demo

Introduction

This blog post provides examples of specifying different regression workflows using the class Regressionizer of the Python package “Regressionizer”, [AAp1].

The primary focus of Regressionizer is Quantile Regression (QR), [RK1, RK2]. It closely follows the monadic pipeline design explained in detail in the document “A monad for Quantile Regression workflows”, [AA1].

For introduction and overview of Quantile Regression see the video “Boston useR! QuantileRegression Workflows 2019-04-18”.

Summary of Regressionizer features

  • The class Regressionizer facilitates rapid specifications of regressions workflows.
    • To quickly specify:
      • data rescaling and summary
      • regression computations
      • outliers finding
      • conditional Cumulative Distribution Functions (CDFs) reconstruction
      • plotting of data, fits, residual errors, outliers, CDFs
  • Regressionizer works with data frames, numpy arrays, lists of numbers, and lists of numeric pairs.

Details and arguments

  • The curves computed with Quantile Regression are called regression quantiles.
  • Regressionizer has three regression methods:
    • quantile_regression
    • quantile_regression_fit
    • least_squares_fit
  • The regression quantiles computed with the methods quantile_regression and quantile_regression_fit correspond to probabilities specified with the argument probs.
  • The methodquantile_regression computes fits using a B-spline functions basis.
    • The basis is specified with the arguments knots and order.
    • order is 3 by default.
  • The methods quantile_regession_fit and least_squares_fit use lists of basis functions to fit with specified with the argument funcs.

Workflows flowchart

The following flowchart summarizes the workflows that are supported by Regressionizer:

Image

Previous work

Roger Koenker implemented the R package “quantreg”, [RKp1]. Anton Antonov implemented the R package “QRMon-R” for the specification of monadic pipelines for doing QR, [AAp1].

Several Wolfram Language (aka Mathematica) packages are implemented by Anton Antonov, see [AAp1, AAp2, AAf1].

Remark: The paclets at the Wolfram Language Paclet Repository were initially Mathematica packages hosted at GitHub. The Wolfram Function Repository function QuantileRegression, [AAf1] does only B-spline fitting.


Setup

Load the “Regressionizer” and other “standard” packages:

from Regressionizer import *

import numpy as np
import plotly.express as px
import plotly.graph_objects as go
template='plotly'

Generate input data

Generate random data:

np.random.seed(0)
x = np.linspace(0, 2, 300)
y = np.sin(2 * np.pi * x) + np.random.normal(0, 0.4, x.shape)
data = np.column_stack((x, y))

Plot the generated data:

fig = px.scatter(x=data[:, 0], y=data[:, 1], labels={'x': 'X-axis', 'y': 'Y-axis'}, template=template, width = 800, height = 600)
fig.show()
Image

Fit given functions

Define a list of functions:

funcs = [lambda x: 1, lambda x: x, lambda x: np.cos(x), lambda x: np.cos(3 * x), lambda x: np.cos(6 * x)]
def chebyshev_t_polynomials(n):
if n == 0:
return lambda x: 1
elif n == 1:
return lambda x: x
else:
T0 = lambda x: 1
T1 = lambda x: x
for i in range(2, n + 1):
Tn = lambda x, T0=T0, T1=T1: 2 * x * T1(x) - T0(x)
T0, T1 = T1, Tn
return Tn

chebyshev_polynomials = [chebyshev_t_polynomials(i) for i in range(10)]

Define regression quantile probabilities:

probs = [0.1, 0.5, 0.9]

Perform Quantile Regression and (non-linear) Least Squares Fit:

obj2 = (
Regressionizer(data)
.echo_data_summary()
.quantile_regression_fit(funcs=chebyshev_polynomials, probs=probs)
.least_squares_fit(funcs=chebyshev_polynomials)
.plot(title = "Quantile Regression and Least Squares fitting using Chebyshev polynomials", template=template)
)
Statistic    Regressor | Value
------------ --------------------
min                 0.0 | -2.0324132316043735
25%                 0.5 | -0.6063257640389526
median              1.0 | -0.0042185202753221695
75%                 1.5 | 0.6300535444986601
max                 2.0 | 1.757964402499859

Plot the obtained regression quantilies and least squares fit:

obj2.take_value().show()
Image

Fit B-splines

Instead of coming-up with basis functions we can use B-spline basis:

obj = Regressionizer(data).quantile_regression(knots=8, probs=[0.2, 0.5, 0.8]).plot(title="B-splines fit", template=template)

Show the obtained plot:

obj.take_value().show()
Image

Here is a dictionary of the found regression quantiles:

obj.take_regression_quantiles()
{0.2: <function QuantileRegression.QuantileRegression._make_combined_function.<locals>.<lambda>(x)>,
 0.5: <function QuantileRegression.QuantileRegression._make_combined_function.<locals>.<lambda>(x)>,
 0.8: <function QuantileRegression.QuantileRegression._make_combined_function.<locals>.<lambda>(x)>}


Weather temperature data

Load weather data:

import pandas as pd

url = "https://raw.githubusercontent.com/antononcube/MathematicaVsR/master/Data/MathematicaVsR-Data-Atlanta-GA-USA-Temperature.csv"
dfTemperature = pd.read_csv(url)
dfTemperature['DateObject'] = pd.to_datetime(dfTemperature['Date'], format='%Y-%m-%d')
dfTemperature = dfTemperature[(dfTemperature['DateObject'].dt.year >= 2020) & (dfTemperature['DateObject'].dt.year <= 2023)]
dfTemperature
DateAbsoluteTimeTemperatureDateObject
25552020-01-0137868256007.562020-01-01
25562020-01-0237869120007.282020-01-02
25572020-01-03378699840012.282020-01-03
25582020-01-04378708480012.782020-01-04
25592020-01-0537871712004.832020-01-05
40112023-12-27391262400011.672023-12-27
40122023-12-2839127104007.442023-12-28
40132023-12-2939127968003.782023-12-29
40142023-12-3039128832004.832023-12-30
40152023-12-3139129696001.172023-12-31

1461 rows × 4 columns

Convert to “numpy” array:

temp_data = dfTemperature[['AbsoluteTime', 'Temperature']].to_numpy()
temp_data.shape
(1461, 2)

Here is pipeline for Quantile Regression computation and making of a corresponding plot:

obj = (
Regressionizer(temp_data)
.echo_data_summary()
.quantile_regression(knots=20, probs=[0.2, 0.5, 0.8])
.date_list_plot(title="Atlanta, Georgia, USA, Temperature, ℃", template=template, data_color="darkgray", width = 1200)
)
Statistic    Regressor | Value
------------ --------------------
min          3786825600.0 |     -11.89
25%          3818361600.0 |      10.06
median       3849897600.0 |      16.94
75%          3881433600.0 |      22.56
max          3912969600.0 |      32.39

Show the obtained plot:

obj.take_value().show()
Image

Fitting errors

Errors

Here the absolute fitting errors are computed and the average is for each is computed:

{ k : np.mean(np.array(d)[:,1]) for k, d in obj.errors(relative_errors=False).take_value().items() }
{0.2: 3.331223347420249, 0.5: 0.020191754857989016, 0.8: -3.3960272281557753}

Error plots

Here we give the fitting errors (residuals) for the regression quantiles found and plotted above:

obj.error_plots(relative_errors=False, date_plot=True, template=template, width=1200, height=300).take_value().show()
Image

Outliers

One way to find contextual outliers in time series is to find regression quantiles at low- and high enough probabilities, and then select the points “outside” of those curves:

obj = (
Regressionizer(temp_data)
.quantile_regression(knots=20, probs=[0.01, 0.99], order=3)
.outliers()
)

obj.take_value()
{'bottom': [array([ 3.7885536e+09, -3.1100000e+00]),
  array([3.7919232e+09, 3.2800000e+00]),
  array([3.795552e+09, 7.390000e+00]),
  array([3.7977984e+09, 9.2800000e+00]),
  array([3.7982304e+09, 1.0220000e+01]),
  array([3.8068704e+09, 2.0110000e+01]),
  array([3.8097216e+09, 1.2390000e+01]),
  array([ 3.8225088e+09, -4.7200000e+00]),
  array([3.8298528e+09, 1.0220000e+01]),
  array([3.8333952e+09, 1.8720000e+01]),
  array([3.8458368e+09, 3.5000000e+00]),
  array([ 3.8524896e+09, -2.3900000e+00])],
 'top': [array([3.7944288e+09, 2.2390000e+01]),
  array([3.802896e+09, 2.756000e+01]),
  array([3.8040192e+09, 2.7940000e+01]),
  array([3.8129184e+09, 2.3000000e+01]),
  array([3.814128e+09, 2.128000e+01]),
  array([3.820608e+09, 1.778000e+01]),
  array([3.8258784e+09, 2.3500000e+01]),
  array([3.8326176e+09, 2.7060000e+01]),
  array([3.839184e+09, 2.617000e+01]),
  array([3.8420352e+09, 2.2780000e+01]),
  array([3.8641536e+09, 2.9830000e+01]),
  array([3.8727072e+09, 2.5610000e+01]),
  array([3.8816928e+09, 1.8060000e+01])]}

Here we plot the outliers (using a “narrower band” than above):

obj = (
Regressionizer(temp_data)
.quantile_regression(knots=20, probs=[0.05, 0.95], order=3)
.outliers_plot(
title="Outliers of Atlanta, Georgia, USA, Temperature, ℃",
data_color="darkgray",
date_plot=True,
template=template,
width = 1200)
)

obj.take_value().show()
Image

Conditional CDF

Here is a list of probabilities to be used to reconstruct Cumulative Distribution Functions (CDFs):

probs = np.sort(np.concatenate((np.arange(0.1, 1.0, 0.1), [0.01, 0.99])))
probs
array([0.01, 0.1 , 0.2 , 0.3 , 0.4 , 0.5 , 0.6 , 0.7 , 0.8 , 0.9 , 0.99])

Here we find the regression quantiles for those probabilities:

obj=(
Regressionizer(temp_data)
.quantile_regression(knots=20,probs=probs)
.date_list_plot(template=template, data_color="darkgray", width=1200)
)

Here we show the plot obtained above:

obj.take_value().show()
Image

Get CDF function

Here we take a date in ISO format and convert to number of seconds since 1900-01-01:

from datetime import datetime

iso_date = "2022-01-01"
date_object = datetime.fromisoformat(iso_date)
epoch = datetime(1900, 1, 1)

focusPoint = int((date_object - epoch).total_seconds())
print(focusPoint)
3849984000

Here the conditional CDF at that date is computed:

aCDFs = obj.conditional_cdf(focusPoint).take_value()
aCDFs
Image
{3849984000: <scipy.interpolate._interpolate.interp1d at 0x135c2c460>}

Plot the obtained CDF function:

xs = np.linspace(obj.take_regression_quantiles()[0.01](focusPoint), obj.take_regression_quantiles()[0.99](focusPoint), 20)
cdf_values = [aCDFs[focusPoint](x) for x in xs]

fig = go.Figure(data=[go.Scatter(x=xs, y=cdf_values, mode='lines')])
# Update layout
fig.update_layout(
title='Temperature Data CDF at ' + str(focusPoint),
xaxis_title='Temperature',
yaxis_title='Probability',
template=template,
legend=dict(title='Legend'),
height=300,
width=800
)
fig.show()

Plot multiple CDFs

Here are few dates converted into number of seconds since 1990-01-01:

pointsForCDFs = [focusPoint + i * 365 * 24 * 3600 for i in range(-1,2)]
pointsForCDFs
[3818448000, 3849984000, 3881520000]

Here are the plots of CDF at those dates:

obj.conditional_cdf_plot(pointsForCDFs, title = 'CDFs', template=template).take_value().show()
Image

References

Articles, books

[RK1] Roger Koenker, Quantile Regression, Cambridge University Press, 2005.

[RK2] Roger Koenker, “Quantile Regression in R: a vignette”, (2006), CRAN.

[AA1] Anton Antonov, “A monad for Quantile Regression workflows”, (2018), MathematicaForPrediction at GitHub.

Packages, paclets

[AAp1] Anton Antonov, Quantile Regression Python package, (2024), GitHub/antononcube.

[AAp2] Anton Antonov, QRMon-R, (2019), GitHub/antononcube.

[AAp3] Anton Antonov, Quantile Regression WL paclet, (2014-2023), GitHub/antononcube.

[AAp4] Anton Antonov, Monadic Quantile Regression WL paclet, (2018-2024), GitHub/antononcube.

[AAf1] Anton Antonov, QuantileRegression, (2019), Wolfram Function Repository.

[RKp1] Roger Koenker, quantregCRAN.

Repositories

[AAr1] Anton Antonov, DSL::English::QuantileRegressionWorkflows in Raku, (2020), GitHub/antononcube.

Videos

[AAv1] Anton Antonov, “Boston useR! QuantileRegression Workflows 2019-04-18”, (2019), Anton Antonov at YouTube.

[AAv2] Anton Antonov, “useR! 2020: How to simplify Machine Learning workflows specifications”, (2020), R Consortium at YouTube.