π Professional-grade web scraping tool with modern Flask interface
Built by Intelligence on Chain - Blockchain and OSInt Investigations
- Overview
- Features
- Quick Start
- Installation
- Usage
- Configuration
- API Reference
- Contributing
- License
- About Intelligence on Chain
The IOC Web Scraper is a sophisticated, web-based application that enables users to download entire websites with all their assets through an intuitive browser interface. Built with enterprise-grade reliability and ethical scraping practices, this tool is perfect for archival, analysis, and offline content storage.
- π‘οΈ Ethical by Design - Respects robots.txt and implements proper crawling etiquette
- π¦ Complete Asset Capture - Downloads HTML, CSS, JavaScript, images, and media files
- β‘ Real-time Progress - Live updates with detailed statistics during scraping
- π¨ Modern Interface - Professional Bootstrap UI with responsive design
- π§ Highly Configurable - Adjustable depth, delays, and crawling parameters
- π Easy Export - Packaged downloads in convenient ZIP format
- π Full Website Downloads - Capture complete websites for offline viewing
- π Real-time Progress Tracking - Monitor pages scraped, files downloaded, and current activity
- βοΈ Configurable Crawling - Set depth limits, request delays, and respect directives
- π¦ ZIP Packaging - Automatic compression and organization of scraped content
- π Background Processing - Non-blocking scraping with job management
- π Python Flask Backend - Robust server-side processing
- π§ BeautifulSoup Integration - Advanced HTML parsing and link extraction
- π¨ Bootstrap 5 Frontend - Modern, responsive user interface
- π Comprehensive Logging - Detailed logs for debugging and monitoring
- π‘οΈ Error Handling - Graceful failure management and recovery
- π€ Robots.txt Compliance - Automatic respect for website crawling directives
- β±οΈ Rate Limiting - Configurable delays to prevent server overload
- π Job Management - Track multiple scraping operations
- πΎ Persistent Storage - Organized file structure for easy access
Get up and running in under 2 minutes:
# Clone the repository
gh repo clone IOCOfficial/webScraper
cd webscraper
# Run the automated installer
chmod +x webscraper.sh
./webscraper.shOpen your browser to http://localhost:8080 and start scraping! π
- Python 3.8 or higher
- pip package manager
- Virtual environment (recommended)
git clone https://github.com/IOCOfficial/webScraper.git
cd webScraper
./install_and_run.sh# Clone repository
git clone https://github.com/IOCOfficial/webScraper.git
cd webScraper
# Create virtual environment
python3 -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install dependencies
pip install -r requirements.txt
# Run the application
python app.py-
Enter Target URL
https://example.com -
Configure Options
- Crawl Depth: 1-4 levels (how deep to follow links)
- Request Delay: 0.1-10.0 seconds (politeness interval)
- Respect robots.txt: Enable/disable robots.txt compliance
-
Start Scraping
- Click "Start Professional Scraping"
- Monitor real-time progress
- Download ZIP when complete
export FLASK_HOST=0.0.0.0
export FLASK_PORT=8080
export FLASK_DEBUG=False
export SECRET_KEY=your-secret-key-here# In scraper.py, modify these defaults:
DEFAULT_DELAY = 1.0 # Seconds between requests
DEFAULT_TIMEOUT = 30 # Request timeout
DEFAULT_USER_AGENT = "..." # Custom user agent| Parameter | Default | Description |
|---|---|---|
max_depth |
2 | Maximum crawling depth |
delay |
1.0 | Delay between requests (seconds) |
respect_robots |
True | Follow robots.txt directives |
timeout |
30 | Request timeout (seconds) |
scraped_sites/
βββ job_[uuid]/
β βββ index.html
β βββ assets/
β β βββ css/
β β βββ js/
β β βββ images/
β βββ scraper.log
βββ job_[uuid].zip
POST /start_scrape
Content-Type: application/json
{
"url": "https://example.com",
"max_depth": 2,
"delay": 1.0,
"respect_robots": true
}GET /job_status/{job_id}
Response:
{
"job_id": "uuid",
"status": "running",
"progress": 45.5,
"pages_scraped": 12,
"files_downloaded": 89,
"message": "Scraping: https://example.com/page1"
}GET /download/{job_id}
Returns: ZIP file download- π’ Website Archival - Preserve important web content
- π Competitive Analysis - Study competitor websites offline
- π SEO Research - Analyze website structures and content
- π Compliance Documentation - Capture evidence of web content
- π§ͺ Testing Environments - Create offline versions for testing
- π Content Migration - Extract content for platform migrations
- π Academic Research - Gather web data for studies
- π§ Website Analysis - Understand site architecture and dependencies
We welcome contributions from the community! Here's how to get started:
# Fork the repository and clone your fork
git clone https://github.com/IOCOfficial/webScraper.git
cd webScraper
# Create a development branch
git checkout -b feature/your-feature-name
# Install development dependencies
pip install -r requirements-dev.txt
# Make your changes and test
python -m pytest tests/
# Submit a pull requestFound a bug or have a feature request? Please open an issue with:
- π Bug description and steps to reproduce
- π» Your environment details (OS, Python version)
- π Expected vs actual behavior
- πΈ Screenshots if applicable
This project is licensed under the MIT License - see the LICENSE file for details.
MIT License - feel free to use, modify, and distribute this software.
- π Bug Reports: GitHub Issues
- π¬ Discussions: Join the Bureau!
- π Documentation: Wiki
- π¬ **Follow Us **: Social Media