Skip to content

sanand0/discourse-scrape

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 

Repository files navigation

Discourse Thread Sync

scrape.py keeps a local archive of Discourse topics. It talks to the Discourse API with your API key, fetches every post in each topic, and writes one JSON file per thread organised by category. md.py turns those JSON exports into lightweight, LLM-friendly Markdown summaries.

Setup

Place your Discourse credentials in .env. For example:

DISCOURSE_API_KEY=...
DISCOURSE_API_USERNAME=s.anand
DISCOURSE_BASE=https://discourse.onlinedegree.iitm.ac.in

Usage

Fetch threads newer than the most recent sync (falls back to seven days ago if you have never run the script):

uv run scrape.py

Fetch only topics under category 34 (including its subcategories):

uv run scrape.py --category 34

Fetch topics updated since a specific UTC date:

uv run scrape.py --since 2025-11-01

You can combine options:

uv run scrape.py --since 2025-11-01 --category 34

Generate Markdown mirrors for every JSON thread (skips files that are already up to date):

uv run md.py

Output

  • Threads are saved under discourse/<category-slug>.<category-id>/<topic-slug>.<topic-id>.json.
  • Each file contains an array of posts (topic plus replies) with the full raw body and metadata.
  • .settings.json keeps incremental state so subsequent runs only download new or updated topics.
  • Markdown mirrors live next to the JSON exports (.md instead of .json) and only regenerate when the JSON source is newer.

Output examples

Example JSON post captured by scrape.py:

[
  {
    "id": 190606,
    "topic_id": 190606,
    "post_number": 1,
    "username": "Mantasa",
    "created_at": "2025-11-02T12:40:05.239Z",
    "raw": "Why is the TA session recording (dated 01/11/2025) not available on the Google Spreadsheet and Calendar?\n\n@22f3000877"
  },
  {
    "id": 190907,
    "topic_id": 190606,
    "post_number": 2,
    "username": "22f3000877",
    "created_at": "2025-11-02T15:29:30.047Z",
    "raw": "The sheet has been updated now."
  }
]

The corresponding Markdown produced by md.py:

# courses.9/mad1-kb.29/about-ta-session-recording.190606

## Post 1

- User: Mantasa
- Time: 2025-11-02T12:40:05.239Z

Why is the TA session recording (dated 01/11/2025) not available on the Google Spreadsheet and Calendar?

@22f3000877

## Post 2

- User: 22f3000877
- Time: 2025-11-02T15:29:30.047Z

The sheet has been updated now.

About

Scrape Discourse API and convert to Markdown

Topics

Resources

Stars

Watchers

Forks

Contributors

Languages