Grawler is a web crawler written in go. It scrapes the website of the given url and finds all relative links and visit these urls. Initially this application was developed to build up the cache of a page and to check the availability of existing pages.
Download and use a binary suitable for your system from the prebuild releases.
If you have go installed (Go version >= 1.22.3) you can use go install to install the application on your system.
go install github.com/robole-dev/grawler@latestgrawler grawl <url>Example
grawler grawl https://www.google.deAll features can be read via the help flag
grawler -hMore examples below.
grawler grawl https://books.toscrape.com- crawles the given url
- search for anchor tags href elements (
<a href="...">) and crawls these urls too
grawler grawl https://books.toscrape.com -o out.csv Set to 8 requests in parallel
grawler grawl https://books.toscrape.com -l 8 Limit to a search recursion depth to 2
grawler grawl https://books.toscrape.com --max-depth 2 Set a delay of 500 milliseconds
grawler grawl https://books.toscrape.com --delay 500 To rrequest a website that uses/requires a http basic auth you can set the username and password as flags
grawler grawl https://books.toscrape.com --username user_xy --password mypassword Optionally you can ommit the password. Then you will be asked to enter the password when you start grawling
grawler grawl https://books.toscrape.com --username user_xy
No config file found.
Grawling https://books.toscrape.com
✔ Password: █By default, only the domain of the start url is allowed to be crawled. All other urls from other domains are being skipped.
You can allow more domains with the -a flag
grawler grawl https://quotes.toscrape.com -a example.com You can also add multiple domains
grawler grawl https://quotes.toscrape.com -a example.com -a google.de You can define one or multiple regular expression to skip urls when they match this/these expressions.
Here we skip all urls starting with https://books.toscrape.com/catalogue/category/books/ with a max depth of 2:
grawler grawl https://books.toscrape.com --disallowed-url-filters "^https://books.toscrape.com/catalogue/category/books/.*" --max-depth 2Here we skip all urls which contain the word category and the word art:
grawler grawl https://books.toscrape.com --disallowed-url-filters "category" --disallowed-url-filters "art"Precedence for configuration is first given to the flags set on the command-line, then to what's set in your configuration file.
Grawler looks first for the command-line flag --config (path to the config file), then to the file grawler.yaml
in the current working directory and at least to the path $HOME/.config/grawler/conf.yaml.
You can generate a config file with default values with the init command.
A sample config files can be found here: sample-conf.yaml.
Currently we have some trouble to track the redirect http status codes.
More infos about that: