🕷️ Intelligence on Chain Web Scraper

🚀 Professional-grade web scraping tool with modern Flask interface
Built by Intelligence on Chain - Blockchain and OSInt Investigations

📋 Table of Contents

Overview
Features
Quick Start
Installation
Usage
Configuration
API Reference
Contributing
License
About Intelligence on Chain

🎯 Overview

The IOC Web Scraper is a sophisticated, web-based application that enables users to download entire websites with all their assets through an intuitive browser interface. Built with enterprise-grade reliability and ethical scraping practices, this tool is perfect for archival, analysis, and offline content storage.

Why Choose IOC Web Scraper?

🛡️ Ethical by Design - Respects robots.txt and implements proper crawling etiquette
📦 Complete Asset Capture - Downloads HTML, CSS, JavaScript, images, and media files
⚡ Real-time Progress - Live updates with detailed statistics during scraping
🎨 Modern Interface - Professional Bootstrap UI with responsive design
🔧 Highly Configurable - Adjustable depth, delays, and crawling parameters
📁 Easy Export - Packaged downloads in convenient ZIP format

✨ Features

Core Functionality

🌐 Full Website Downloads - Capture complete websites for offline viewing
📊 Real-time Progress Tracking - Monitor pages scraped, files downloaded, and current activity
⚙️ Configurable Crawling - Set depth limits, request delays, and respect directives
📦 ZIP Packaging - Automatic compression and organization of scraped content
🔄 Background Processing - Non-blocking scraping with job management

Technical Highlights

🐍 Python Flask Backend - Robust server-side processing
🔧 BeautifulSoup Integration - Advanced HTML parsing and link extraction
🎨 Bootstrap 5 Frontend - Modern, responsive user interface
📝 Comprehensive Logging - Detailed logs for debugging and monitoring
🛡️ Error Handling - Graceful failure management and recovery

Professional Features

🤖 Robots.txt Compliance - Automatic respect for website crawling directives
⏱️ Rate Limiting - Configurable delays to prevent server overload
📋 Job Management - Track multiple scraping operations
💾 Persistent Storage - Organized file structure for easy access

🚀 Quick Start

Get up and running in under 2 minutes:

# Clone the repository
gh repo clone IOCOfficial/webScraper
cd webscraper

# Run the automated installer
chmod +x webscraper.sh
./webscraper.sh

Open your browser to http://localhost:8080 and start scraping! 🎉

📦 Installation

Prerequisites

Python 3.8 or higher
pip package manager
Virtual environment (recommended)

Method 1: Automated Installation (Recommended)

git clone https://github.com/IOCOfficial/webScraper.git
cd webScraper
./install_and_run.sh

Method 2: Manual Installation

# Clone repository
git clone https://github.com/IOCOfficial/webScraper.git
cd webScraper

# Create virtual environment
python3 -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

# Run the application
python app.py

💻 Usage

Basic Web Scraping

Enter Target URL
```
https://example.com
```
Configure Options
- Crawl Depth: 1-4 levels (how deep to follow links)
- Request Delay: 0.1-10.0 seconds (politeness interval)
- Respect robots.txt: Enable/disable robots.txt compliance
Start Scraping
- Click "Start Professional Scraping"
- Monitor real-time progress
- Download ZIP when complete

Advanced Configuration

Environment Variables

export FLASK_HOST=0.0.0.0
export FLASK_PORT=8080
export FLASK_DEBUG=False
export SECRET_KEY=your-secret-key-here

Customizing Scraper Behavior

# In scraper.py, modify these defaults:
DEFAULT_DELAY = 1.0        # Seconds between requests
DEFAULT_TIMEOUT = 30       # Request timeout
DEFAULT_USER_AGENT = "..."  # Custom user agent

⚙️ Configuration

Scraping Parameters

Parameter	Default	Description
`max_depth`	2	Maximum crawling depth
`delay`	1.0	Delay between requests (seconds)
`respect_robots`	True	Follow robots.txt directives
`timeout`	30	Request timeout (seconds)

File Structure

scraped_sites/
├── job_[uuid]/
│   ├── index.html
│   ├── assets/
│   │   ├── css/
│   │   ├── js/
│   │   └── images/
│   └── scraper.log
└── job_[uuid].zip

🔌 API Reference

REST Endpoints

Start Scraping Job

POST /start_scrape
Content-Type: application/json

{
    "url": "https://example.com",
    "max_depth": 2,
    "delay": 1.0,
    "respect_robots": true
}

Check Job Status

GET /job_status/{job_id}

Response:
{
    "job_id": "uuid",
    "status": "running",
    "progress": 45.5,
    "pages_scraped": 12,
    "files_downloaded": 89,
    "message": "Scraping: https://example.com/page1"
}

Download Results

GET /download/{job_id}

Returns: ZIP file download

📊 Use Cases

Business Applications

🏢 Website Archival - Preserve important web content
📈 Competitive Analysis - Study competitor websites offline
🔍 SEO Research - Analyze website structures and content
📋 Compliance Documentation - Capture evidence of web content

Development & Research

🧪 Testing Environments - Create offline versions for testing
📚 Content Migration - Extract content for platform migrations
🎓 Academic Research - Gather web data for studies
🔧 Website Analysis - Understand site architecture and dependencies

🤝 Contributing

We welcome contributions from the community! Here's how to get started:

Development Setup

# Fork the repository and clone your fork
git clone https://github.com/IOCOfficial/webScraper.git
cd webScraper

# Create a development branch
git checkout -b feature/your-feature-name

# Install development dependencies
pip install -r requirements-dev.txt

# Make your changes and test
python -m pytest tests/

# Submit a pull request

Reporting Issues

Found a bug or have a feature request? Please open an issue with:

🐛 Bug description and steps to reproduce
💻 Your environment details (OS, Python version)
📝 Expected vs actual behavior
📸 Screenshots if applicable

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

MIT License - feel free to use, modify, and distribute this software.

📞 Support & Community

🐛 Bug Reports: GitHub Issues
💬 Discussions: Join the Bureau!
📖 Documentation: Wiki
💬 **Follow Us **: Social Media

Built with ❤️ by the IOC Team

⭐ Star this repository if you find it useful!

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
webScraper.sh		webScraper.sh

License

IOCOfficial/webScraper

Folders and files

Latest commit

History

Repository files navigation