Grawler

Web crawler that discovers and visits relative links on a website.

Grawler is a web crawler written in go. It scrapes the website of the given url and finds all relative links and visit these urls. Initially this application was developed to build up the cache of a page and to check the availability of existing pages.

Install

Binary

Download and use a binary suitable for your system from the prebuild releases.

Go

If you have go installed (Go version >= 1.22.3) you can use go install to install the application on your system.

go install github.com/robole-dev/grawler@latest

Usage

grawler grawl <url>

Example

grawler grawl https://www.google.de

All features can be read via the help flag

grawler -h

More examples below.

Examples

Crawl website

grawler grawl https://books.toscrape.com

crawles the given url
search for anchor tags href elements (<a href="...">) and crawls these urls too

Save result to a CSV-file

grawler grawl https://books.toscrape.com -o out.csv

Allow parallel requests

Set to 8 requests in parallel

grawler grawl https://books.toscrape.com -l 8

Limit the search depth

Limit to a search recursion depth to 2

grawler grawl https://books.toscrape.com --max-depth 2

Set a delay for each request

Set a delay of 500 milliseconds

grawler grawl https://books.toscrape.com --delay 500

Request a page with http basic auth

To rrequest a website that uses/requires a http basic auth you can set the username and password as flags

grawler grawl https://books.toscrape.com --username user_xy --password mypassword

Optionally you can ommit the password. Then you will be asked to enter the password when you start grawling

grawler grawl https://books.toscrape.com --username user_xy
No config file found.
Grawling https://books.toscrape.com
✔ Password: █

Add allowed domains

By default, only the domain of the start url is allowed to be crawled. All other urls from other domains are being skipped. You can allow more domains with the -a flag

grawler grawl https://quotes.toscrape.com -a example.com

You can also add multiple domains

grawler grawl https://quotes.toscrape.com -a example.com -a google.de

Skip/Disallow urls

You can define one or multiple regular expression to skip urls when they match this/these expressions.

Here we skip all urls starting with https://books.toscrape.com/catalogue/category/books/ with a max depth of 2:

grawler grawl https://books.toscrape.com --disallowed-url-filters "^https://books.toscrape.com/catalogue/category/books/.*" --max-depth 2

Here we skip all urls which contain the word category and the word art:

grawler grawl https://books.toscrape.com --disallowed-url-filters "category" --disallowed-url-filters "art"

Configuration

Precedence for configuration is first given to the flags set on the command-line, then to what's set in your configuration file.

Grawler looks first for the command-line flag --config (path to the config file), then to the file grawler.yaml in the current working directory and at least to the path $HOME/.config/grawler/conf.yaml.

You can generate a config file with default values with the init command.

A sample config files can be found here: sample-conf.yaml.

Need to know

Currently we have some trouble to track the redirect http status codes.

More infos about that:

Name		Name	Last commit message	Last commit date
Latest commit History 69 Commits
.github/workflows		.github/workflows
cmd		cmd
internal		internal
.gitignore		.gitignore
.goreleaser.yaml		.goreleaser.yaml
LICENSE		LICENSE
README.md		README.md
Taskfile.dist.yaml		Taskfile.dist.yaml
go.mod		go.mod
go.sum		go.sum
icon.svg		icon.svg
main.go		main.go
sample-conf.yaml		sample-conf.yaml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Grawler

Install

Binary

Go

Usage

Examples

Crawl website

Save result to a CSV-file

Allow parallel requests

Limit the search depth

Set a delay for each request

Request a page with http basic auth

Add allowed domains

Skip/Disallow urls

Configuration

Need to know

About

Uh oh!

Releases 20

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Grawler

Install

Binary

Go

Usage

Examples

Crawl website

Save result to a CSV-file

Allow parallel requests

Limit the search depth

Set a delay for each request

Request a page with http basic auth

Add allowed domains

Skip/Disallow urls

Configuration

Need to know

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 20

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages