Skip to content

IOCOfficial/webScraper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

10 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸ•·οΈ Intelligence on Chain Web Scraper

Python Flask License IOC

πŸš€ Professional-grade web scraping tool with modern Flask interface
Built by Intelligence on Chain - Blockchain and OSInt Investigations

πŸ“‹ Table of Contents

🎯 Overview

The IOC Web Scraper is a sophisticated, web-based application that enables users to download entire websites with all their assets through an intuitive browser interface. Built with enterprise-grade reliability and ethical scraping practices, this tool is perfect for archival, analysis, and offline content storage.

Why Choose IOC Web Scraper?

  • πŸ›‘οΈ Ethical by Design - Respects robots.txt and implements proper crawling etiquette
  • πŸ“¦ Complete Asset Capture - Downloads HTML, CSS, JavaScript, images, and media files
  • ⚑ Real-time Progress - Live updates with detailed statistics during scraping
  • 🎨 Modern Interface - Professional Bootstrap UI with responsive design
  • πŸ”§ Highly Configurable - Adjustable depth, delays, and crawling parameters
  • πŸ“ Easy Export - Packaged downloads in convenient ZIP format

✨ Features

Core Functionality

  • 🌐 Full Website Downloads - Capture complete websites for offline viewing
  • πŸ“Š Real-time Progress Tracking - Monitor pages scraped, files downloaded, and current activity
  • βš™οΈ Configurable Crawling - Set depth limits, request delays, and respect directives
  • πŸ“¦ ZIP Packaging - Automatic compression and organization of scraped content
  • πŸ”„ Background Processing - Non-blocking scraping with job management

Technical Highlights

  • 🐍 Python Flask Backend - Robust server-side processing
  • πŸ”§ BeautifulSoup Integration - Advanced HTML parsing and link extraction
  • 🎨 Bootstrap 5 Frontend - Modern, responsive user interface
  • πŸ“ Comprehensive Logging - Detailed logs for debugging and monitoring
  • πŸ›‘οΈ Error Handling - Graceful failure management and recovery

Professional Features

  • πŸ€– Robots.txt Compliance - Automatic respect for website crawling directives
  • ⏱️ Rate Limiting - Configurable delays to prevent server overload
  • πŸ“‹ Job Management - Track multiple scraping operations
  • πŸ’Ύ Persistent Storage - Organized file structure for easy access

πŸš€ Quick Start

Get up and running in under 2 minutes:

# Clone the repository
gh repo clone IOCOfficial/webScraper
cd webscraper

# Run the automated installer
chmod +x webscraper.sh
./webscraper.sh

Open your browser to http://localhost:8080 and start scraping! πŸŽ‰

πŸ“¦ Installation

Prerequisites

  • Python 3.8 or higher
  • pip package manager
  • Virtual environment (recommended)

Method 1: Automated Installation (Recommended)

git clone https://github.com/IOCOfficial/webScraper.git
cd webScraper
./install_and_run.sh

Method 2: Manual Installation

# Clone repository
git clone https://github.com/IOCOfficial/webScraper.git
cd webScraper

# Create virtual environment
python3 -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

# Run the application
python app.py

πŸ’» Usage

Basic Web Scraping

  1. Enter Target URL

    https://example.com
    
  2. Configure Options

    • Crawl Depth: 1-4 levels (how deep to follow links)
    • Request Delay: 0.1-10.0 seconds (politeness interval)
    • Respect robots.txt: Enable/disable robots.txt compliance
  3. Start Scraping

    • Click "Start Professional Scraping"
    • Monitor real-time progress
    • Download ZIP when complete

Advanced Configuration

Environment Variables

export FLASK_HOST=0.0.0.0
export FLASK_PORT=8080
export FLASK_DEBUG=False
export SECRET_KEY=your-secret-key-here

Customizing Scraper Behavior

# In scraper.py, modify these defaults:
DEFAULT_DELAY = 1.0        # Seconds between requests
DEFAULT_TIMEOUT = 30       # Request timeout
DEFAULT_USER_AGENT = "..."  # Custom user agent

βš™οΈ Configuration

Scraping Parameters

Parameter Default Description
max_depth 2 Maximum crawling depth
delay 1.0 Delay between requests (seconds)
respect_robots True Follow robots.txt directives
timeout 30 Request timeout (seconds)

File Structure

scraped_sites/
β”œβ”€β”€ job_[uuid]/
β”‚   β”œβ”€β”€ index.html
β”‚   β”œβ”€β”€ assets/
β”‚   β”‚   β”œβ”€β”€ css/
β”‚   β”‚   β”œβ”€β”€ js/
β”‚   β”‚   └── images/
β”‚   └── scraper.log
└── job_[uuid].zip

πŸ”Œ API Reference

REST Endpoints

Start Scraping Job

POST /start_scrape
Content-Type: application/json

{
    "url": "https://example.com",
    "max_depth": 2,
    "delay": 1.0,
    "respect_robots": true
}

Check Job Status

GET /job_status/{job_id}

Response:
{
    "job_id": "uuid",
    "status": "running",
    "progress": 45.5,
    "pages_scraped": 12,
    "files_downloaded": 89,
    "message": "Scraping: https://example.com/page1"
}

Download Results

GET /download/{job_id}

Returns: ZIP file download

πŸ“Š Use Cases

Business Applications

  • 🏒 Website Archival - Preserve important web content
  • πŸ“ˆ Competitive Analysis - Study competitor websites offline
  • πŸ” SEO Research - Analyze website structures and content
  • πŸ“‹ Compliance Documentation - Capture evidence of web content

Development & Research

  • πŸ§ͺ Testing Environments - Create offline versions for testing
  • πŸ“š Content Migration - Extract content for platform migrations
  • πŸŽ“ Academic Research - Gather web data for studies
  • πŸ”§ Website Analysis - Understand site architecture and dependencies

🀝 Contributing

We welcome contributions from the community! Here's how to get started:

Development Setup

# Fork the repository and clone your fork
git clone https://github.com/IOCOfficial/webScraper.git
cd webScraper

# Create a development branch
git checkout -b feature/your-feature-name

# Install development dependencies
pip install -r requirements-dev.txt

# Make your changes and test
python -m pytest tests/

# Submit a pull request

Reporting Issues

Found a bug or have a feature request? Please open an issue with:

  • πŸ› Bug description and steps to reproduce
  • πŸ’» Your environment details (OS, Python version)
  • πŸ“ Expected vs actual behavior
  • πŸ“Έ Screenshots if applicable

πŸ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.

MIT License - feel free to use, modify, and distribute this software.

πŸ“ž Support & Community


Built with ❀️ by the IOC Team

⭐ Star this repository if you find it useful!

Intelligence on Chain

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages