ValidMLInference is a Python package for correcting bias and performing valid inference in regressions that include variables generated by AI/ML methods. The bias-correction methods are described in Battaglia, Christensen, Hansen & Sacher (2024).
ValidMLInference runs on Python 3.8 and requires standard numerical packages: numpy, scipy, jax, jaxopt, and numdifftools.
To install the package, run
pip install ValidMLInference
in your terminal.
To get started, we recommend looking at the following examples and resources:
- Remote Work: This notebook estimates the association between working from home and salaries using real-world job postings data (Hansen et al., 2023). It illustrates how the functions
ols_bca,ols_bcmandone_stepcan be used to correct bias from regressing on AI/ML-generated labels. The notebook reproduces results from Table 1 of Battaglia, Christensen, Hansen & Sacher (2024). - Topic Models: This notebook estimates the association between CEO time allocation and firm performance (Bandiera et al. 2020). It illustrates how the functions
ols_bca_topicandols_bcm_topiccan be used to correct bias from estimated topic model shares. The notebook reproduces results from Table 2 of Battaglia, Christensen, Hansen & Sacher (2024). - Synthetic Example: A synthetic example comparing the performance of different bias-correction methods in the context of AI/ML-generated labels.
- Functionality: A detailed reference describing all available functions, optional arguments, and usage tips.
Code below compares coefficients obtained by ordinary least squares methods and those obtained by the one_step approach, when used on variables subject to classification error. We can see that the 95% confidence interval generated by one_step contains the true parameter of 2, whereas the standard ols approach doesn't.
import numpy as np
import pandas as pd
from ValidMLInference import ols, one_step
# Set random seed for reproducibility
np.random.seed(42)
# Generate synthetic data with mislabeling
n = 1000
true_effect = 2.0
# True treatment assignment
X_true = np.random.binomial(1, 0.5, n)
# Observed (mislabeled) treatment with 20% error rate
mislabel_prob = 0.2
X_obs = X_true.copy()
mislabel_mask = np.random.binomial(1, mislabel_prob, n).astype(bool)
X_obs[mislabel_mask] = 1 - X_obs[mislabel_mask]
# Generate outcome with true treatment effect
Y = 1.0 + true_effect * X_true + np.random.normal(0, 1, n)
# Create DataFrame
data = pd.DataFrame({'Y': Y, 'X_obs': X_obs})
# Naive OLS using mislabeled data
ols_result = ols(formula="Y ~ X_obs", data=data)
print("OLS Results (using mislabeled data):")
print(ols_result.summary())
# One-step estimator that corrects for mislabeling
one_step_result = one_step(formula="Y ~ X_obs", data=data)
print("\nOne-Step Results (correcting for mislabeling):")
print(one_step_result.summary())
ols_ci = ols_result.summary().loc['X_obs', ['2.5%', '97.5%']]
one_step_ci = one_step_result.summary().loc['X_obs', ['2.5%', '97.5%']]
print(f"\nTrue treatment effect: {true_effect}")
print(f"OLS 95% CI contains true value: {ols_ci['2.5%'] <= true_effect <= ols_ci['97.5%']}")
print(f"One-step 95% CI contains true value: {one_step_ci['2.5%'] <= true_effect <= one_step_ci['97.5%']}")OLS Results (using mislabeled data):
Estimate Std. Error z value P>|z| 2.5% 97.5%
Intercept 1.392265 0.055828 24.938313 0.0 1.282843 1.501687
X_obs 1.207589 0.078643 15.355267 0.0 1.053451 1.361727
One-Step Results (correcting for mislabeling):
Estimate Std. Error z value P>|z| 2.5% 97.5%
X_obs 1.828638 0.108976 16.780127 0.0 1.615048 2.042228
Intercept 1.092510 0.107082 10.202534 0.0 0.882633 1.302387
True treatment effect: 2.0
OLS 95% CI contains true value: False
One-step 95% CI contains true value: True