TDS Datasets

Utilities to create datasets for the Tools in Data Science course.

Crawl HTML

crawl_html.py generates a connected web of HTML files with random paths and cross-links for testing crawlers, web scrapers, and link analysis tools. It creates a hierarchical structure of HTML files (0-3 levels deep) with:

Random English words for file and directory names. It uses the environment variable RANDOM_SEED for reproducible generation.
Generates multiple files at each level
Cross-references ensuring every file is reachable from index.html and creates links between files

How it works:

Structure Generation: Creates file paths using Faker (random English words), building a tree structure with varying depths
Connectivity: Ensures graph connectivity by giving every file at least one incoming link, plus random cross-links (1-3 per file)
HTML Generation: Creates minimal HTML files with titles and navigation links using relative paths

Usage:

# Generate HTML files in crawl_html/ directory
TDS_RANDOM_SEED=... python crawl_html.py

# List file paths without creating files
TDS_RANDOM_SEED=... python crawl_html.py --list

HTML Table

html_table.py generates an HTML file with 30 tables containing random data for testing table parsing, scraping, and data extraction tools. Each table has:

A numbered title (Table 1, Table 2, etc.)
50 rows and 10 columns (Col 1 through Col 10)
Random English words in each cell using Faker
Uses the environment variable TDS_RANDOM_SEED for reproducible generation

Usage:

# Generate HTML table file in html_table/ directory
TDS_RANDOM_SEED=... python html_table.py

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
.github/workflows		.github/workflows
js_table		js_table
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
cdp_trap.py		cdp_trap.py
crawl_html.py		crawl_html.py
html_table.py		html_table.py
json_table.py		json_table.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TDS Datasets

Crawl HTML

HTML Table

License

About

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

TDS Datasets

Crawl HTML

HTML Table

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages