Data Detective is an open-source, modular, extensible validation framework for identifying potential issues with heterogeneous, multimodal data.
- Do splits used for model training come from othe same distribution?
- Are there any anomalies present in the dataset?
- Are the conditional independences we expect in the data obeyed?
- Are the datapoints at inference in the same distribution as what we have used to train/test the model?
- Are there near or exact duplicates present within the dataset?
- Are there mislabeled samples present within the dataset?
# install packages supporting rank aggregation
git clone https://github.com/thelahunginjeet/pyrankagg.git
git clone https://github.com/thelahunginjeet/kbutil.git
pip install -r requirements.txtIf you are planning on using Data Detective in a jupyter notebook, please ensure that the kernel is switched to the appropriate virtual environoment.
If you are planning to make use of the pretrained transform library for high dimensional inputs, follow the additional install steps outlined below.
# for huggingface hosted models
pip install transformers| notebook | description |
|---|---|
| Tutorial | To get started on our tutorial dataset and step through each part of an investigation, see the tutorial. |
| Quickstart | To get started as quickly as possible on your own data, please see Quickstart.ipynb in the examples folder. |
| Extending DD | To extend the capacity of Data Detective to your custom validation or transform needs, see ExtendingDD.ipynb |
To contribute to Data Detective, please first complete the ExtendingDD jupyter notebook to learn more about
how to extend Data Detective to add new validator methods, validators, and validator methods to the Data Detective
ecosystem. To submit a contribution, issue a pull request containing the contribution as well as any relevant
tests guaranteeing functionality of the implementation. All pull requests must be approved by at least one Data Detective
administrator before being merged into the master branch.
There should be at least one test attached to each validator method / transform. All submitted code should be well-documented and follow the PEP-8 standard.
- Zhang et. al for KCI/FCIT, used in validator method
- Zhao et. al for pyod
- Kevin Brown for pyrankagg
- all interviewed members of Genentech/Roche for continued feedback during development


