scrape.py keeps a local archive of Discourse topics. It talks to the Discourse API with your API key, fetches every post in each topic, and writes one JSON file per thread organised by category. md.py turns those JSON exports into lightweight, LLM-friendly Markdown summaries.
Place your Discourse credentials in .env. For example:
DISCOURSE_API_KEY=...
DISCOURSE_API_USERNAME=s.anand
DISCOURSE_BASE=https://discourse.onlinedegree.iitm.ac.inFetch threads newer than the most recent sync (falls back to seven days ago if you have never run the script):
uv run scrape.pyFetch only topics under category 34 (including its subcategories):
uv run scrape.py --category 34Fetch topics updated since a specific UTC date:
uv run scrape.py --since 2025-11-01You can combine options:
uv run scrape.py --since 2025-11-01 --category 34Generate Markdown mirrors for every JSON thread (skips files that are already up to date):
uv run md.py- Threads are saved under
discourse/<category-slug>.<category-id>/<topic-slug>.<topic-id>.json. - Each file contains an array of posts (topic plus replies) with the full raw body and metadata.
.settings.jsonkeeps incremental state so subsequent runs only download new or updated topics.- Markdown mirrors live next to the JSON exports (
.mdinstead of.json) and only regenerate when the JSON source is newer.
Example JSON post captured by scrape.py:
[
{
"id": 190606,
"topic_id": 190606,
"post_number": 1,
"username": "Mantasa",
"created_at": "2025-11-02T12:40:05.239Z",
"raw": "Why is the TA session recording (dated 01/11/2025) not available on the Google Spreadsheet and Calendar?\n\n@22f3000877"
},
{
"id": 190907,
"topic_id": 190606,
"post_number": 2,
"username": "22f3000877",
"created_at": "2025-11-02T15:29:30.047Z",
"raw": "The sheet has been updated now."
}
]The corresponding Markdown produced by md.py:
# courses.9/mad1-kb.29/about-ta-session-recording.190606
## Post 1
- User: Mantasa
- Time: 2025-11-02T12:40:05.239Z
Why is the TA session recording (dated 01/11/2025) not available on the Google Spreadsheet and Calendar?
@22f3000877
## Post 2
- User: 22f3000877
- Time: 2025-11-02T15:29:30.047Z
The sheet has been updated now.