ABCDE (Affect, Body, Cognition, Demographics, and Emotion) is a Python library for processing large textual datasets to extract demographic self-identification (e.g., age, gender, occupation, location, religion) and compute linguistic features (e.g., emotions, valence-arousal-dominance, warmth-competence, tense, body part mentions).
- Self-Identification Detection: Uses regex patterns to detect statements like "I am 25 years old" or "I live in London". Includes mappings (e.g., city to country, religion to category).
- Linguistic Feature Extraction: Computes features from various lexicons (e.g., NRC emotions, VAD, worry words) and custom lists (e.g., body parts, tenses).
- Data Processing Pipelines: Scripts for downloading, extracting, and processing datasets in parallel (e.g., via SLURM for HPC environments).
- Output: Generates TSV files for users (aggregated demographics) and posts (per-entry features with demographics).
- Reddit (2010-2022): Posts and comments from Pushshift dumps
- TUSC (Twitter): Geolocated tweets from the TUSC dataset
- Google Books Ngrams: English fiction 5-grams
- AI-Generated Text: WildChat, LMSYS, PIPPA, HH-RLHF, RAID, and more
- Blog Spinner Data: XML-based blog posts
abcde/
├── src/abcde/ # Main Python package
│ ├── __init__.py
│ ├── core/ # Core detection and feature extraction
│ │ ├── detector.py # SelfIdentificationDetector class
│ │ ├── features.py # Linguistic feature computation
│ │ └── pii.py # PII detection
│ ├── lexicons/ # Lexicon loading utilities
│ │ └── loaders.py
│ ├── io/ # I/O utilities
│ │ ├── csv_utils.py # CSV/TSV reading/writing
│ │ ├── jsonl_utils.py # JSONL file handling
│ │ └── demographics.py # Demographics aggregation
│ └── utils/ # Utility functions
│ ├── banner.py
│ ├── dates.py
│ └── text.py
├── scripts/ # Processing scripts (CLI tools)
│ ├── process_reddit.py
│ ├── process_tusc.py
│ ├── process_ngrams.py
│ ├── process_ai_text.py
│ ├── process_spinner.py
│ └── merge_dataset.py
├── bin/ # Shell scripts
│ ├── download/ # Data download scripts
│ │ ├── download-reddit/
│ │ └── download-google-books-ngrams/
│ └── run/ # Pipeline execution scripts
│ ├── run_reddit_pipeline.sh
│ ├── run_tusc_pipeline_full.sh
│ └── ...
├── lexicons/ # Lexicon data files
│ ├── NRC-*.txt # NRC emotion/VAD/warmth lexicons
│ ├── DMG-*.txt # Demographic data (countries, cities, etc.)
│ ├── BPM-*.txt # Body part mentions
│ ├── COG-*.json # Cognitive/thinking words
│ └── TIME-*.txt # Tense data
├── tests/ # Test files
├── helpers.py # Backward-compatible imports
├── pyproject.toml # Package configuration
├── README.md # This file
└── DATASET.md # Dataset documentation
# Clone the repository
git clone https://github.com/abcde-project/abcde.git
cd abcde
# Install with uv (recommended)
pip install uv
uv sync
# Or install with pip
pip install -e .- Python 3.10+
- Dependencies are managed via
pyproject.toml
python -c "import nltk; nltk.download('stopwords')"from abcde import SelfIdentificationDetector, apply_linguistic_features
# Detect demographic self-identification
detector = SelfIdentificationDetector()
text = "I am 25 years old and I live in London. I'm a software engineer."
matches = detector.detect(text)
print(matches)
# {'age': ['25'], 'city': ['london'], 'occupation': ['software engineer']}
# Get mappings (city -> country, etc.)
detailed = detector.detect_with_mappings(text)
print(detailed['city'])
# {'raw': ['london'], 'country_mapped': ['United Kingdom']}
# Extract linguistic features
features = apply_linguistic_features(text)
print(features['NRCAvgValence']) # Average valence score
print(features['WordCount']) # Word count# Process Reddit data
python scripts/process_reddit.py \
--input_dir /path/to/reddit/extracted \
--output_dir /path/to/output \
--stages both
# Process TUSC data
python scripts/process_tusc.py \
--input_file /path/to/tusc-city.parquet \
--output_dir /path/to/output \
--stages both
# Process AI-generated text
python scripts/process_ai_text.py \
--input_file /path/to/wildchat.csv \
--output_dir /path/to/output \
--dataset_name wildchat
# Process Google Books Ngrams
python scripts/process_ngrams.py \
--input_path /path/to/ngrams \
--output_path /path/to/output# Submit Reddit pipeline
sbatch bin/run/run_reddit_pipeline.sh
# Submit TUSC pipeline
sbatch bin/run/run_tusc_pipeline_full.sh
# Submit AI text pipeline
sbatch bin/run/run_ai_text_pipeline.sh# Generate URLs and download
cd bin/download/download-reddit
./create-reddit-urls.sh
./download-reddit.sh
# Extract
./extract-reddit.shcd bin/download/download-google-books-ngrams
./create-google-books-ngram-urls.sh
./download-google-books-ngrams.sh
./extract-google-books-ngrams.shDownload from https://github.com/Priya22/EmotionDynamics and place parquet files in your data directory.
All lexicons are stored in lexicons/ and include:
| Lexicon | Description | Source |
|---|---|---|
| NRC-Emotion-Lexicon.txt | Word-emotion associations | NRC |
| NRC-VAD-Lexicon.txt | Valence-Arousal-Dominance scores | NRC |
| NRC-WorryWords-Lexicon.txt | Anxiety/calmness words | NRC |
| NRC-*Warmth-Lexicon.txt | Social warmth scores | NRC |
| DMG-country-list.txt | Country names | Wikipedia |
| DMG-geonames-*.csv | City data with countries | GeoNames |
| DMG-religion-list.csv | Religion mappings | Wikipedia |
| TIME-eng-word-tenses.txt | English verb tenses | UniMorph |
| BPM-bodywords-full.txt | Body part words | Collins/Enchanted Learning |
| COG-thinking-words-categorized.json | Cognitive verbs | Custom |
Contains aggregated demographics per user:
Author: User IDDMGMajorityBirthyear: Resolved birth yearDMGRawExtractedCity,DMGCountryMappedFromExtractedCityDMGRawExtractedReligion,DMGMainReligionMappedFromExtractedReligionDMGRawExtractedOccupation,DMGSOCTitleMappedFromExtractedOccupation
Contains per-post data with linguistic features:
- Post metadata (ID, text, timestamp, etc.)
DMGAgeAtPost: Age when the post was created- NRC features (
NRCAvgValence,NRCHasJoyWord, etc.) - Pronoun features (
PRNHasI,PRNHasWe, etc.) - Tense features (
TIMEHasPastVerb,TIMECountPresentVerbs, etc.) - Cognitive features (
COGHasAnalyzingEvaluatingWord, etc.) - Body part mentions (
MyBPM,HasBPM, etc.)
See DATASET.md for complete feature documentation.
# Run tests
uv run pytest
# Run with timeout
uv run pytest --timeout=120# Install dev dependencies
uv sync --group dev
# Run linter
uv run ruff check .
# Run formatter
uv run ruff format .If you use this dataset or code in your research, please cite:
@misc{abcde2024,
title={ABCDE: Age-Based Corpus of Demographic Expressions},
author={ABCDE Team},
year={2024},
url={https://github.com/abcde-project/abcde}
}This project is licensed under the MIT License - see the LICENSE file for details.
- NRC lexicons by Dr. Saif Mohammad
- UniMorph project for tense data
- GeoNames for city data
- Pushshift for Reddit archives