Data Detective

Data Detective is an open-source, modular, extensible validation framework for identifying potential issues with heterogeneous, multimodal data.

Examples of issues that are in scope for Data Detective to detect

Do splits used for model training come from othe same distribution?
Are there any anomalies present in the dataset?
Are the conditional independences we expect in the data obeyed?
Are the datapoints at inference in the same distribution as what we have used to train/test the model?
Are there near or exact duplicates present within the dataset?
Are there mislabeled samples present within the dataset?

Workflow

Installation Steps

# install packages supporting rank aggregation
git clone https://github.com/thelahunginjeet/pyrankagg.git
git clone https://github.com/thelahunginjeet/kbutil.git

pip install -r requirements.txt

If you are planning on using Data Detective in a jupyter notebook, please ensure that the kernel is switched to the appropriate virtual environoment.

If you are planning to make use of the pretrained transform library for high dimensional inputs, follow the additional install steps outlined below.

# for huggingface hosted models
pip install transformers

Examples and Guide Notebooks

notebook	description
Tutorial	To get started on our tutorial dataset and step through each part of an investigation, see the tutorial.
Quickstart	To get started as quickly as possible on your own data, please see Quickstart.ipynb in the examples folder.
Extending DD	To extend the capacity of Data Detective to your custom validation or transform needs, see ExtendingDD.ipynb

Contributing

To contribute to Data Detective, please first complete the ExtendingDD jupyter notebook to learn more about how to extend Data Detective to add new validator methods, validators, and validator methods to the Data Detective ecosystem. To submit a contribution, issue a pull request containing the contribution as well as any relevant tests guaranteeing functionality of the implementation. All pull requests must be approved by at least one Data Detective administrator before being merged into the master branch.

There should be at least one test attached to each validator method / transform. All submitted code should be well-documented and follow the PEP-8 standard.

Acknowledgements

Zhang et. al for KCI/FCIT, used in validator method
Zhao et. al for pyod
Kevin Brown for pyrankagg
all interviewed members of Genentech/Roche for continued feedback during development

Name		Name	Last commit message	Last commit date
Latest commit History 177 Commits
.github/workflows		.github/workflows
examples		examples
src		src
test		test
.gitignore		.gitignore
DD_im.png		DD_im.png
LICENSE		LICENSE
README.md		README.md
constants.py		constants.py
requirements.txt		requirements.txt
requirements_ext.txt		requirements_ext.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data Detective

Examples of issues that are in scope for Data Detective to detect

Workflow

Installation Steps

Examples and Guide Notebooks

Contributing

Acknowledgements

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Data Detective

Examples of issues that are in scope for Data Detective to detect

Workflow

Installation Steps

Examples and Guide Notebooks

Contributing

Acknowledgements

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages