This repository provides Jupyter notebooks for evaluating closed-source text-based AI safety moderation classifiers. The focus is on analyzing how these classifiers handle content related to various protected groups.
preprocessing.ipynb: 📥 Downloads a large dataset and creates sub-datasets for protected groups.augmentations.ipynb: 🔄 Demonstrates backtranslation of texts from English to German and back to English for data augmentation or robustness testing.uniformRandomClassifier.ipynb: ⚖️ Provides a fairness baseline by assigning safe or unsafe outcomes with equal probability.moderationGPT.ipynb: Obtains moderation results using the OpenAI ASM.clarifaiModeration.ipynb: Obtains moderation results using the Clarifai ASM.perspectiveModeration.ipynb: Obtains moderation results using the Google Perspective ASM.googleModeration.ipynb: Obtains moderation results using the Google PaLM2 based ASM.fairnessComputation.ipynb: ⚖️ Performs a comparative fairness analysis on the ASMs using demographic parity and conditional statistical parity metrics.robustness.ipynb: 🛠️ Performs robustness analysis on the ASMs using input perturbation techniques like backtranslations and paraphrasing.process_raw_robustness_results.ipynb: 🧹 Processes moderation outputs to obtain binary results (safe/unsafe). If moderation results are unavailable, processed outputs can be directly loaded from the results folder.microrobustness.ipynb: Conducts deeper robustness analysis, computing the percentage of safe-to-unsafe and unsafe-to-safe transitions for all ASMs.regard.ipynb: Performs regard sentiment analysis, classifying input texts into "positive," "negative," "neutral," and "other" categories.voyage.ipynb: Obtains text embeddings using the voyage-large-2-instruct model.
- Data Preprocessing: Run
preprocessing.ipynbto download and prepare datasets. - Data Augmentation: Use
augmentations.ipynbfor backtranslations. - Fairness Baseline: Execute
uniformRandomClassifier.ipynbfor a fairness baseline. - Text Embeddings: Run
voyage.ipynbfor embeddings using the voyage-large-2-instruct model. - Moderation Results: Use the following notebooks to obtain moderation results:
moderationGPT.ipynbfor OpenAI ASMclarifaiModeration.ipynbfor Clarifai ASMperspectiveModeration.ipynbfor Google Perspective ASMgoogleModeration.ipynbfor Google PaLM2 based ASM
- Fairness Analysis: Use
fairnessComputation.ipynbto perform a comparative fairness analysis on the ASMs. - Robustness Analysis:
- Use
robustness.ipynbto analyze the robustness of the ASMs using perturbation techniques. - Use
microrobustness.ipynbfor a deeper analysis of safe-to-unsafe and unsafe-to-safe transitions.
- Use
- Processing Results: Use
process_raw_robustness_results.ipynbto convert moderation outputs into binary safe/unsafe results. If moderation outputs are not available, load processed results directly from the results folder. - Sentiment Analysis: Run
regard.ipynbto classify input texts into "positive," "negative," "neutral," and "other" sentiment classes.
The repository includes considerations for:
- 🧠 Ideology
- 🚺 Gender
- 🌍 Race
- ♿ Disability
- 🌈 Sexual Orientation
Contributions are welcome! Please fork the repository and submit a pull request.
Licensed under the MIT License - see the LICENSE file for details.
If you use our code for your research, please cite our paper:
@article{achara2025watching,
title={Watching the AI Watchdogs: A Fairness and Robustness Analysis of AI Safety Moderation Classifiers},
author={Achara, Akshit and Chhabra, Anshuman},
journal={arXiv preprint arXiv:2501.13302},
year={2025}
}