🛡️ AI Safety Moderation Classifier Evaluation

📜 Overview

This repository provides Jupyter notebooks for evaluating closed-source text-based AI safety moderation classifiers. The focus is on analyzing how these classifiers handle content related to various protected groups.

📚 Contents

preprocessing.ipynb: 📥 Downloads a large dataset and creates sub-datasets for protected groups.
augmentations.ipynb: 🔄 Demonstrates backtranslation of texts from English to German and back to English for data augmentation or robustness testing.
uniformRandomClassifier.ipynb: ⚖️ Provides a fairness baseline by assigning safe or unsafe outcomes with equal probability.
moderationGPT.ipynb: Obtains moderation results using the OpenAI ASM.
clarifaiModeration.ipynb: Obtains moderation results using the Clarifai ASM.
perspectiveModeration.ipynb: Obtains moderation results using the Google Perspective ASM.
googleModeration.ipynb: Obtains moderation results using the Google PaLM2 based ASM.
fairnessComputation.ipynb: ⚖️ Performs a comparative fairness analysis on the ASMs using demographic parity and conditional statistical parity metrics.
robustness.ipynb: 🛠️ Performs robustness analysis on the ASMs using input perturbation techniques like backtranslations and paraphrasing.
process_raw_robustness_results.ipynb: 🧹 Processes moderation outputs to obtain binary results (safe/unsafe). If moderation results are unavailable, processed outputs can be directly loaded from the results folder.
microrobustness.ipynb: Conducts deeper robustness analysis, computing the percentage of safe-to-unsafe and unsafe-to-safe transitions for all ASMs.
regard.ipynb: Performs regard sentiment analysis, classifying input texts into "positive," "negative," "neutral," and "other" categories.
voyage.ipynb: Obtains text embeddings using the voyage-large-2-instruct model.

🚀 Usage

Data Preprocessing: Run preprocessing.ipynb to download and prepare datasets.
Data Augmentation: Use augmentations.ipynb for backtranslations.
Fairness Baseline: Execute uniformRandomClassifier.ipynb for a fairness baseline.
Text Embeddings: Run voyage.ipynb for embeddings using the voyage-large-2-instruct model.
Moderation Results: Use the following notebooks to obtain moderation results:
- moderationGPT.ipynb for OpenAI ASM
- clarifaiModeration.ipynb for Clarifai ASM
- perspectiveModeration.ipynb for Google Perspective ASM
- googleModeration.ipynb for Google PaLM2 based ASM
Fairness Analysis: Use fairnessComputation.ipynb to perform a comparative fairness analysis on the ASMs.
Robustness Analysis:
- Use robustness.ipynb to analyze the robustness of the ASMs using perturbation techniques.
- Use microrobustness.ipynb for a deeper analysis of safe-to-unsafe and unsafe-to-safe transitions.
Processing Results: Use process_raw_robustness_results.ipynb to convert moderation outputs into binary safe/unsafe results. If moderation outputs are not available, load processed results directly from the results folder.
Sentiment Analysis: Run regard.ipynb to classify input texts into "positive," "negative," "neutral," and "other" sentiment classes.

🛡️ Protected Groups

The repository includes considerations for:

🧠 Ideology
🚺 Gender
🌍 Race
♿ Disability
🌈 Sexual Orientation

🤝 Contributing

Contributions are welcome! Please fork the repository and submit a pull request.

📜 License

Licensed under the MIT License - see the LICENSE file for details.

Citation

If you use our code for your research, please cite our paper:

@article{achara2025watching,
  title={Watching the AI Watchdogs: A Fairness and Robustness Analysis of AI Safety Moderation Classifiers},
  author={Achara, Akshit and Chhabra, Anshuman},
  journal={arXiv preprint arXiv:2501.13302},
  year={2025}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🛡️ AI Safety Moderation Classifier Evaluation

📜 Overview

📚 Contents

🚀 Usage

🛡️ Protected Groups

🤝 Contributing

📜 License

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 41 Commits
data		data
results		results
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
augmentations.ipynb		augmentations.ipynb
clarifaiModeration.ipynb		clarifaiModeration.ipynb
fairnessComputation.ipynb		fairnessComputation.ipynb
fairness_metrics.py		fairness_metrics.py
fairness_visualization.py		fairness_visualization.py
googleModeration.ipynb		googleModeration.ipynb
microrobustness.ipynb		microrobustness.ipynb
moderationGPT.ipynb		moderationGPT.ipynb
perspectiveModeration.ipynb		perspectiveModeration.ipynb
preprocessing.ipynb		preprocessing.ipynb
process_raw_robustness_results.ipynb		process_raw_robustness_results.ipynb
regard.ipynb		regard.ipynb
requirements.txt		requirements.txt
robustness.ipynb		robustness.ipynb
uniformRandomClassifier.ipynb		uniformRandomClassifier.ipynb
utils.py		utils.py
voyage.ipynb		voyage.ipynb

Folders and files

Latest commit

History

Repository files navigation

🛡️ AI Safety Moderation Classifier Evaluation

📜 Overview

📚 Contents

🚀 Usage

🛡️ Protected Groups

🤝 Contributing

📜 License

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages