Name	Name	Last commit message	Last commit date
parent directory ..
20news_test_set_pyx.npy	20news_test_set_pyx.npy
README.md	README.md
amazon_pyx.part1_of_3.npy	amazon_pyx.part1_of_3.npy
amazon_pyx.part2_of_3.npy	amazon_pyx.part2_of_3.npy
amazon_pyx.part3_of_3.npy	amazon_pyx.part3_of_3.npy
audioset_eval_set_pyx.npy	audioset_eval_set_pyx.npy
caltech256_pyx.npy	caltech256_pyx.npy
cifar100_test_set_pyx.npy	cifar100_test_set_pyx.npy
cifar10_test_set_pyx.npy	cifar10_test_set_pyx.npy
imagenet_val_set_pyx.part1_of_4.npy	imagenet_val_set_pyx.part1_of_4.npy
imagenet_val_set_pyx.part2_of_4.npy	imagenet_val_set_pyx.part2_of_4.npy
imagenet_val_set_pyx.part3_of_4.npy	imagenet_val_set_pyx.part3_of_4.npy
imagenet_val_set_pyx.part4_of_4.npy	imagenet_val_set_pyx.part4_of_4.npy
imdb_test_set_pyx.npy	imdb_test_set_pyx.npy
mnist_test_set_pyx.npy	mnist_test_set_pyx.npy

Name

Last commit message

Last commit date

README.md

amazon_pyx.part1_of_3.npy

amazon_pyx.part2_of_3.npy

amazon_pyx.part3_of_3.npy

audioset_eval_set_pyx.npy

caltech256_pyx.npy

cifar100_test_set_pyx.npy

cifar10_test_set_pyx.npy

imagenet_val_set_pyx.part1_of_4.npy

imagenet_val_set_pyx.part2_of_4.npy

imagenet_val_set_pyx.part3_of_4.npy

imagenet_val_set_pyx.part4_of_4.npy

imdb_test_set_pyx.npy

mnist_test_set_pyx.npy

Cross Validated Predicted Probabilties for each of the ten datasets

Each file is an numpy.array(dtype=np.float16) of shape (num_test_examples x num_classes). We DO NOT quantize predicted probabilities tonp.float16 to reduce file size. Quantization won't work with confident learning / cleanlab because label error ordering depends on the rank of the predicted probabilities. If you have a lot of examples with a probability near 1, then you will lose the ranking over those examples if you quantize. Thus, we have to upload the full float64 file size.

Predicted probabilities are computed out of sample using cross validation. In cases, where the dataset has a seperate train and test set, we first pre-trained on the train set, then fine-tuned, using cross-validation, to obtain the out-of-sample predicted probabilities on the test set.

ImageNet and Amazon Reviews Predicted Probabilities

The pyx.npy predicted probabilities file for these two datasets exceeds the 100MB GitHub limit for a file. To get around this, we uploaded these pyx files in parts.

To combine the parts for ImageNet, you can run:

import numpy as np
n_parts = 4
fn = 'imagenet_val_set_pyx.part{}_of_{}.npy'
parts = [np.load(fn.format(i + 1, n_parts)) for i in range(n_parts)]
# Combine the parts using np.vstack like this 
imagenet_pyx = np.vstack(parts)

amazon_pyx works similarly.

Quickdraw Predicted Probabilities

quickdraw_pyx.npy is not included here because it is enormous (33GB) (the dataset has over 50 million examples). We provide quickdraw_pyx.npy as its own release here: https://github.com/cleanlab/label-errors/releases/tag/quickdraw-pyx-v1

Although it affects confident learning's ability to rank, we quantize the quickdraw_pyx.npy to np.float16 to make the size more manageable.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

Cross Validated Predicted Probabilties for each of the ten datasets

ImageNet and Amazon Reviews Predicted Probabilities

Quickdraw Predicted Probabilities

FilesExpand file tree

cross_validated_predicted_probabilities

Directory actions

More options

Directory actions

More options

Latest commit

History

cross_validated_predicted_probabilities

Folders and files

parent directory

README.md

Cross Validated Predicted Probabilties for each of the ten datasets

ImageNet and Amazon Reviews Predicted Probabilities

Quickdraw Predicted Probabilities