AutoScraper is a web application designed to streamline the process of searching for vehicles on AutoTrader.ca. It allows users to define detailed search criteria (payloads), execute searches concurrently, filter results based on inclusion/exclusion keywords, save search parameters and results, and leverage AI for in-depth analysis of specific listings.
Finding specific vehicles, especially niche models or configurations, on large platforms like AutoTrader can be time-consuming. Users often need to manually sift through numerous listings, apply filters repeatedly, and cross-reference information externally (e.g., reliability data, market pricing). AutoScraper aims to solve this by:
- Automating Searches: Running searches based on saved, detailed criteria.
- Batch Processing: Fetching and processing large numbers of listings efficiently.
- Advanced Filtering: Allowing users to specify exclusion keywords (e.g., "salvage", "rebuilt") and required inclusion keywords (e.g., "leather", "navigation") to quickly narrow down relevant listings.
- Persistence: Saving search payloads and results for later use and review.
- AI-Powered Insights: Providing automated analysis of a listing's reliability, price competitiveness (including currency conversion awareness), and negotiation points directly within the application.
- Web-Based UI: User-friendly interface built with Flask and Bootstrap.
- User Authentication: Secure login/registration using Firebase Authentication.
- Detailed Search Criteria: Specify Make, Model, Year Range, Price Range, Mileage Range, Location, Distance, New/Used status, and Photo requirements.
- Keyword Filtering:
- Exclusions: Automatically filter out listings containing specified keywords in their details.
- Inclusions: Filter results to only show listings containing a specific required keyword.
- Concurrent Scraping: Utilizes
concurrent.futuresfor faster fetching of search result pages and individual vehicle details from AutoTrader.ca. - Payload Management:
- Create and save complex search parameter sets (payloads) to Firebase Firestore.
- Load, rename, and delete saved payloads.
- Results Management:
- Save search results (including detailed vehicle info) to Firebase Firestore.
- View saved results in a sortable table within a modal.
- Rename and delete saved result sets.
- CSV Export: Download search results as a CSV file.
- Batch Link Opening: Open multiple selected listing links in browser tabs.
- AI Analysis (Google Gemini & Custom Search):
- For any listing in the results, trigger an AI analysis.
- Performs targeted web searches (using Google Custom Search) for reliability info, common problems, reviews, TSBs, etc.
- Fetches current exchange rates (CAD/USD/EUR/GBP).
- Sends car details, web search context, and exchange rates to Google Gemini AI.
- Displays a formatted summary including:
- Reliability assessment (known issues, positives, things to check).
- Price analysis and deal rating (considering mileage and currency conversion).
- Actionable negotiation tips.
- Includes retry logic with exponential backoff for web search API calls.
- Proxy Support: Configurable proxy usage for scraping requests (
proxyconfig.json). - Logging: Records scraper activity and potential errors to
autoscraper.log.
- Backend: Python, Flask
- Frontend: HTML, CSS, JavaScript, Bootstrap 5
- Scraping:
requests,BeautifulSoup4 - Concurrency:
concurrent.futures - Database/Auth: Firebase (Authentication, Firestore)
- AI: Google Generative AI (Gemini), Google Custom Search API
- Exchange Rates: ExchangeRate-API.com
- Markdown Parsing: Showdown.js (via CDN)
.
├── .gitattributes
├── .gitignore
├── app.py # Flask application logic, API routes
├── auth_decorator.py # Decorator for requiring login on routes
├── AutoScraper.code-workspace # VS Code workspace file
├── AutoScraper.py # Core AutoTrader scraping logic
├── AutoScraperBackground.jpg # Background image
├── AutoScraperUtil.py # Utility functions for scraping, parsing, file handling
├── config.json # Stores API keys (GEMINI, SEARCH, EXCHANGE_RATE) - **DO NOT COMMIT**
├── firebase_config.py # Firebase initialization and interaction functions
├── GetUserSelection.py # Console input functions
├── proxyconfig.json # Proxy configuration (optional) - **DO NOT COMMIT if contains credentials**
├── Queries/ # Directory for saving console payloads
├── Results/ # Directory for saving CSV results
├── static/
│ └── favicon.ico # Favicon
├── templates/
│ ├── index.html # Main application page template
│ ├── login.html # Login page template
│ ├── logout.html # Logout intermediate page template
│ └── register.html # Registration page template
└── README.md # This file
A rewrite of this application in Rust is currently underway, located in the autoscraper_rust/ directory.
Motivation:
- Explore the performance benefits and type safety offered by Rust.
- Leverage modern asynchronous programming with Tokio and Axum.
- Build a more robust and potentially scalable backend.
Technology Stack (Rust Version):
- Backend Framework: Axum
- Asynchronous Runtime: Tokio
- HTTP Client: Reqwest (for AutoTrader API, VDP scraping, Firebase REST API)
- HTML Parsing: scraper crate
- Serialization/Deserialization: Serde (Serde JSON)
- Templating: Askama
- Authentication: Firebase ID Token verification (using
jsonwebtoken, Google public keys viareqwest), Axum extractor for route protection. - Database: Firebase Firestore (via REST API using
reqwest). - Configuration: config crate, dotenv
- Logging: tracing
- CSV Handling: csv crate
- Concurrency: Tokio tasks,
futures::stream::FuturesUnordered - Frontend: Similar HTML, CSS, JavaScript structure as the Python version, served by Axum with
tower-http.
Current Implemented Features:
- Web Server: Axum server setup with Tokio runtime.
- Static Files & Pages: Serving static assets (CSS, JS) and HTML templates (landing, login, main app) using Askama.
- Configuration: Loading settings from
.envandconfig.toml. - Proxy Support: Reads
proxyconfig.jsonand setsHTTP_PROXY/HTTPS_PROXYenvironment variables forreqwest. - Logging: Uses
tracingfor application logs. - AutoTrader Metadata API: Endpoints to fetch Makes, Models, Trims, Colors.
- Core Scraping:
/api/searchendpoint initiates a search based on provided parameters.- Fetches listing summaries from AutoTrader's search endpoint with retries.
- Concurrently fetches detailed Vehicle Detail Pages (VDPs) for each listing using
reqwestandFuturesUnordered. - Includes robust retry logic and concurrency limits for VDP fetching to handle rate limiting (HTTP 429).
- Parses VDP HTML (specifically embedded JSON data) using the
scrapercrate to extract detailed vehicle information. - Saves the detailed results to a CSV file in
Results/Make_Model/.
- Authentication:
/loginendpoint verifies a Firebase ID token sent from the frontend.- Fetches Google's public keys to validate the token signature and claims (audience, issuer, expiry).
- Provides an
AuthenticatedUserextractor to protect specific API routes (requiresAuthorization: Bearer <token>header).
- Firestore Integration (REST API):
- Authenticated endpoints (
/api/payloads,/api/settings) to save and retrieve search payloads and user settings from Firestore.
- Authenticated endpoints (
Features NOT Yet Implemented (Compared to Python Version):
- Persistent User Sessions: While tokens are verified on login/API calls, there's no session cookie mechanism to keep users logged in across browser sessions.
- AI Analysis: No integration with Google Gemini or Custom Search API.
- Results Management UI: Saved results are only exported to CSV; no UI for viewing, managing, or analyzing saved results within the app.
- Detailed Keyword Filtering: Filtering based on VDP details (engine, transmission, etc.) is not implemented (only initial title exclusion/inclusion).
- User Registration Flow: Only login token verification exists.
- Other Static Pages: Routes/handlers for About, Pricing, Terms, Logout pages are not yet implemented.
(Setup and usage instructions for the Rust version will be added to this README or within the autoscraper_rust/ directory as development progresses.)
- Clone the Repository:
git clone <repository-url> cd AutoScraper
- Create a Virtual Environment (Recommended):
python -m venv venv # On Windows venv\Scripts\activate # On macOS/Linux source venv/bin/activate
- Install Dependencies:
(Note: Ensure
pip install Flask requests beautifulsoup4 google-api-python-client google-generativeai firebase-admin # If using lxml for potentially faster parsing (optional): # pip install lxml
lxmlis installed if you modifyAutoScraperUtil.pyto use it as the BeautifulSoup parser)
This application requires several API keys to function fully, especially for the AI analysis features.
-
Create
config.json: In the project's root directory, create a file namedconfig.json. -
Add API Keys: Populate
config.jsonwith your keys. IMPORTANT: Keep this file secure and do not commit it to version control (it should be listed in.gitignore).{ "GEMINI_API_KEY": "YOUR_GOOGLE_AI_GEMINI_API_KEY", "SEARCH_API_KEY": "YOUR_GOOGLE_CUSTOM_SEARCH_API_KEY", "SEARCH_ENGINE_ID": "YOUR_PROGRAMMABLE_SEARCH_ENGINE_ID_CX", "EXCHANGE_RATE_API_KEY": "YOUR_EXCHANGERATE-API.COM_API_KEY" }GEMINI_API_KEY: Obtain from Google AI Studio.SEARCH_API_KEY: Obtain from Google Cloud Console (enable "Custom Search API").SEARCH_ENGINE_ID: Obtain from Google Programmable Search Engine control panel after creating a search engine instance. Configure this instance to search relevant sites (e.g.,reddit.com/r/cars, specific car forums) or the entire web. (this project uses the entire web for now)EXCHANGE_RATE_API_KEY: Obtain a free key from ExchangeRate-API.com.
-
Firebase Configuration:
- Set up a Firebase project at https://firebase.google.com/.
- Enable Authentication (Email/Password method).
- Enable Firestore database.
- Download your Firebase Admin SDK service account key (a JSON file).
- Crucially: Update
firebase_config.pywith the correct path to your downloaded service account key file and potentially your Firebase project details if they differ from the placeholders. - Update the
firebaseConfigobject intemplates/login.html,templates/register.html, andtemplates/index.htmlwith your Firebase project's web configuration details (apiKey, authDomain, projectId, etc.).
-
Proxy Configuration (Optional):
- If you need to use proxies for scraping, create a
proxyconfig.jsonfile in the root directory. Format it according to howAutoScraper.pyexpects to read it (likely{"http": "http://user:pass@host:port", "https": "https://user:pass@host:port"}).
- If you need to use proxies for scraping, create a
- Run the Flask Application:
python app.py
- Access the Web Interface: Open your browser and navigate to
http://localhost:5000. - Register/Login: Use the web interface to register a new account or log in. Authentication is handled by Firebase.
- Define Search: Use the form to select Make, Model, Year, Price, Mileage, Location, and other filters (including exclusion/inclusion keywords).
- Create Payload: Click "Create Payload" to generate the search parameters based on the form.
- Save Payload (Optional): Click "Save Current" to save the current payload to Firebase for later use. You'll be prompted to name it.
- Load Payload (Optional): Select a previously saved payload from the dropdown and click "Load Selected" to populate the form.
- Fetch Data: Click "Fetch Data". The backend will start scraping AutoTrader based on the current payload. Progress might be logged in the console where
app.pyis running. - View Results: Once fetching is complete, the "Current Search Results" card will update. Click "View Selected" under "Saved Results Management" (after selecting the newly created result set) to open a modal displaying the listings in a table.
- Analyze Results:
- In the results modal table, click the "Analyze" () button next to any listing.
- A new modal will appear, showing a loading indicator while the backend performs web searches and calls the Gemini AI.
- The AI's analysis (reliability, price comparison, negotiation tips) will be displayed in the modal.
- Other Actions:
- Use "Open Links" / "Open Selected Links" to open listings in your browser.
- Use "Download CSV" / "Download Selected Result" to get the data in CSV format.
- Use the "Rename" and "Delete" buttons in the Payload and Results management sections as needed.
- Input: Takes Make, Model, Year, Trim, Price (CAD), and Kilometers of the selected car.
- Web Search: Uses Google Custom Search API to query for reliability, common problems, reviews, TSBs, complaints, and specific forums (e.g., Reddit). Uses retry logic for robustness.
- Exchange Rates: Fetches current CAD vs. USD/EUR/GBP rates via ExchangeRate-API.com.
- AI Model: Uses Google Gemini (configured in
app.py). - Prompting: Provides the AI with structured car details, exchange rates, and web search snippets. Asks for specific analysis points: Reliability Summary, Price Analysis (with currency awareness and deal rating), and Negotiation Tips.
- Output: Displays the AI's formatted response (Markdown rendered as HTML) in a dedicated modal.
- More sophisticated web scraping error handling.
- User configuration for target forums in Programmable Search Engine.
- Caching exchange rates to reduce API calls.
- Background task queue (e.g., Celery) for long-running scrape jobs instead of blocking web requests.
- More detailed progress reporting during scraping.
- UI improvements for displaying AI analysis and search context.
- Allowing users to customize the AI prompt.
- Integration with other data sources (e.g., CarFax reports if APIs exist).
- TODO: Implement robust security measures (input validation, rate limiting, dependency scanning).
- TODO: Refactor backend to handle concurrent data fetch requests from multiple users efficiently and without blocking (e.g., using asynchronous tasks/workers like Celery or RQ).
- TODO: Improve code rigidity and scalability (e.g., comprehensive testing, better error handling, modular design).
- TODO: Enhance mobile-friendliness (Responsive Design):
- Review and adjust Bootstrap grid layouts (
col-xs-*,col-sm-*,col-md-*,col-lg-*) for optimal display on various screen sizes. - Use CSS media queries to apply specific styles for different viewport widths (e.g., adjusting font sizes, margins, padding).
- Ensure interactive elements (buttons, forms, modals) are easily usable on touchscreens (sufficient size and spacing).
- Test the application thoroughly on different mobile devices and browsers (or using browser developer tools).
- Review and adjust Bootstrap grid layouts (