ABCDE - Affect, Body, Cognition, Demographics, and Emotion

Overview

ABCDE (Affect, Body, Cognition, Demographics, and Emotion) is a Python library for processing large textual datasets to extract demographic self-identification (e.g., age, gender, occupation, location, religion) and compute linguistic features (e.g., emotions, valence-arousal-dominance, warmth-competence, tense, body part mentions).

Key Features

Self-Identification Detection: Uses regex patterns to detect statements like "I am 25 years old" or "I live in London". Includes mappings (e.g., city to country, religion to category).
Linguistic Feature Extraction: Computes features from various lexicons (e.g., NRC emotions, VAD, worry words) and custom lists (e.g., body parts, tenses).
Data Processing Pipelines: Scripts for downloading, extracting, and processing datasets in parallel (e.g., via SLURM for HPC environments).
Output: Generates TSV files for users (aggregated demographics) and posts (per-entry features with demographics).

Supported Datasets

Reddit (2010-2022): Posts and comments from Pushshift dumps
TUSC (Twitter): Geolocated tweets from the TUSC dataset
Google Books Ngrams: English fiction 5-grams
AI-Generated Text: WildChat, LMSYS, PIPPA, HH-RLHF, RAID, and more
Blog Spinner Data: XML-based blog posts

Repository Structure

abcde/
├── src/abcde/              # Main Python package
│   ├── __init__.py
│   ├── core/               # Core detection and feature extraction
│   │   ├── detector.py     # SelfIdentificationDetector class
│   │   ├── features.py     # Linguistic feature computation
│   │   └── pii.py          # PII detection
│   ├── lexicons/           # Lexicon loading utilities
│   │   └── loaders.py
│   ├── io/                 # I/O utilities
│   │   ├── csv_utils.py    # CSV/TSV reading/writing
│   │   ├── jsonl_utils.py  # JSONL file handling
│   │   └── demographics.py # Demographics aggregation
│   └── utils/              # Utility functions
│       ├── banner.py
│       ├── dates.py
│       └── text.py
├── scripts/                # Processing scripts (CLI tools)
│   ├── process_reddit.py
│   ├── process_tusc.py
│   ├── process_ngrams.py
│   ├── process_ai_text.py
│   ├── process_spinner.py
│   └── merge_dataset.py
├── bin/                    # Shell scripts
│   ├── download/           # Data download scripts
│   │   ├── download-reddit/
│   │   └── download-google-books-ngrams/
│   └── run/                # Pipeline execution scripts
│       ├── run_reddit_pipeline.sh
│       ├── run_tusc_pipeline_full.sh
│       └── ...
├── lexicons/               # Lexicon data files
│   ├── NRC-*.txt           # NRC emotion/VAD/warmth lexicons
│   ├── DMG-*.txt           # Demographic data (countries, cities, etc.)
│   ├── BPM-*.txt           # Body part mentions
│   ├── COG-*.json          # Cognitive/thinking words
│   └── TIME-*.txt          # Tense data
├── tests/                  # Test files
├── helpers.py              # Backward-compatible imports
├── pyproject.toml          # Package configuration
├── README.md               # This file
└── DATASET.md              # Dataset documentation

Installation

Quick Install

# Clone the repository
git clone https://github.com/abcde-project/abcde.git
cd abcde

# Install with uv (recommended)
pip install uv
uv sync

# Or install with pip
pip install -e .

Requirements

Python 3.10+
Dependencies are managed via pyproject.toml

Download NLTK Data

python -c "import nltk; nltk.download('stopwords')"

Usage

As a Library

from abcde import SelfIdentificationDetector, apply_linguistic_features

# Detect demographic self-identification
detector = SelfIdentificationDetector()
text = "I am 25 years old and I live in London. I'm a software engineer."
matches = detector.detect(text)
print(matches)
# {'age': ['25'], 'city': ['london'], 'occupation': ['software engineer']}

# Get mappings (city -> country, etc.)
detailed = detector.detect_with_mappings(text)
print(detailed['city'])
# {'raw': ['london'], 'country_mapped': ['United Kingdom']}

# Extract linguistic features
features = apply_linguistic_features(text)
print(features['NRCAvgValence'])  # Average valence score
print(features['WordCount'])      # Word count

Processing Scripts

# Process Reddit data
python scripts/process_reddit.py \
    --input_dir /path/to/reddit/extracted \
    --output_dir /path/to/output \
    --stages both

# Process TUSC data
python scripts/process_tusc.py \
    --input_file /path/to/tusc-city.parquet \
    --output_dir /path/to/output \
    --stages both

# Process AI-generated text
python scripts/process_ai_text.py \
    --input_file /path/to/wildchat.csv \
    --output_dir /path/to/output \
    --dataset_name wildchat

# Process Google Books Ngrams
python scripts/process_ngrams.py \
    --input_path /path/to/ngrams \
    --output_path /path/to/output

SLURM/HPC Usage

# Submit Reddit pipeline
sbatch bin/run/run_reddit_pipeline.sh

# Submit TUSC pipeline
sbatch bin/run/run_tusc_pipeline_full.sh

# Submit AI text pipeline
sbatch bin/run/run_ai_text_pipeline.sh

Data Downloading

Reddit (2010-2022)

# Generate URLs and download
cd bin/download/download-reddit
./create-reddit-urls.sh
./download-reddit.sh

# Extract
./extract-reddit.sh

Google Books Ngrams

cd bin/download/download-google-books-ngrams
./create-google-books-ngram-urls.sh
./download-google-books-ngrams.sh
./extract-google-books-ngrams.sh

TUSC (Twitter)

Download from https://github.com/Priya22/EmotionDynamics and place parquet files in your data directory.

Lexicons

All lexicons are stored in lexicons/ and include:

Lexicon	Description	Source
NRC-Emotion-Lexicon.txt	Word-emotion associations	NRC
NRC-VAD-Lexicon.txt	Valence-Arousal-Dominance scores	NRC
NRC-WorryWords-Lexicon.txt	Anxiety/calmness words	NRC
NRC-*Warmth-Lexicon.txt	Social warmth scores	NRC
DMG-country-list.txt	Country names	Wikipedia
DMG-geonames-*.csv	City data with countries	GeoNames
DMG-religion-list.csv	Religion mappings	Wikipedia
TIME-eng-word-tenses.txt	English verb tenses	UniMorph
BPM-bodywords-full.txt	Body part words	Collins/Enchanted Learning
COG-thinking-words-categorized.json	Cognitive verbs	Custom

Output Format

Users TSV

Contains aggregated demographics per user:

Author: User ID
DMGMajorityBirthyear: Resolved birth year
DMGRawExtractedCity, DMGCountryMappedFromExtractedCity
DMGRawExtractedReligion, DMGMainReligionMappedFromExtractedReligion
DMGRawExtractedOccupation, DMGSOCTitleMappedFromExtractedOccupation

Posts TSV

Contains per-post data with linguistic features:

Post metadata (ID, text, timestamp, etc.)
DMGAgeAtPost: Age when the post was created
NRC features (NRCAvgValence, NRCHasJoyWord, etc.)
Pronoun features (PRNHasI, PRNHasWe, etc.)
Tense features (TIMEHasPastVerb, TIMECountPresentVerbs, etc.)
Cognitive features (COGHasAnalyzingEvaluatingWord, etc.)
Body part mentions (MyBPM, HasBPM, etc.)

See DATASET.md for complete feature documentation.

Testing

# Run tests
uv run pytest

# Run with timeout
uv run pytest --timeout=120

Development

# Install dev dependencies
uv sync --group dev

# Run linter
uv run ruff check .

# Run formatter
uv run ruff format .

Citation

If you use this dataset or code in your research, please cite:

@misc{abcde2024,
  title={ABCDE: Age-Based Corpus of Demographic Expressions},
  author={ABCDE Team},
  year={2024},
  url={https://github.com/abcde-project/abcde}
}

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

NRC lexicons by Dr. Saif Mohammad
UniMorph project for tense data
GeoNames for city data
Pushshift for Reddit archives

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ABCDE - Affect, Body, Cognition, Demographics, and Emotion

Overview

Key Features

Supported Datasets

Repository Structure

Installation

Quick Install

Requirements

Download NLTK Data

Usage

As a Library

Processing Scripts

SLURM/HPC Usage

Data Downloading

Reddit (2010-2022)

Google Books Ngrams

TUSC (Twitter)

Lexicons

Output Format

Users TSV

Posts TSV

Testing

Development

Citation

License

Acknowledgments

About

Uh oh!

Releases 1

Uh oh!

Contributors 2

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 55 Commits
bin		bin
lexicons		lexicons
scripts		scripts
src/abcde		src/abcde
tests		tests
.gitignore		.gitignore
.python-version		.python-version
DATASET.md		DATASET.md
README.md		README.md
helpers.py		helpers.py
pyproject.toml		pyproject.toml
uv.lock		uv.lock

jpwahle/abcde

Folders and files

Latest commit

History

Repository files navigation

ABCDE - Affect, Body, Cognition, Demographics, and Emotion

Overview

Key Features

Supported Datasets

Repository Structure

Installation

Quick Install

Requirements

Download NLTK Data

Usage

As a Library

Processing Scripts

SLURM/HPC Usage

Data Downloading

Reddit (2010-2022)

Google Books Ngrams

TUSC (Twitter)

Lexicons

Output Format

Users TSV

Posts TSV

Testing

Development

Citation

License

Acknowledgments

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases 1

Uh oh!

Contributors 2

Uh oh!

Languages