The Quant crawler is a handy tool that lets you easily copy a website. This feature has been available via the Quant CLI since last year, but we recently did quite a few updates (see the April 2021 changelog for more details). For a quick-and-dirty usage example, check out How to freeze and take static archive of your old site. In this post, we’ll go into more detail about using the Quant CLI crawler.
Why use the crawler?
Web crawlers are very useful for a number of reasons. Our What is a web crawler? blog post has a bunch broken down, but we generally find that many people are interested in the following:
Archival: If you are using a CMS or backend web technology and don’t want to support it anymore for whatever reason and don’t need content updates, you can archive your site and then just host the static archived version and decommission the old tech.
Failover: If you want to be able to switch to a static copy of your site if your website goes down, you can regularly crawl your site and keep a version on hand for this. For example, you might crawl daily. The static content will be slightly stale but it’s better than having no site!
Backups: This is pretty much the same as having failover copies. One nice thing about doing regularly snapshots is that you can see the content revisions in Quant.
Using Quant CLI
If you are a developer, then command-line tools are your best friends. For non-developers, we will be adding a way to run the crawler in the dashboard at some point. Let us know if this is something you are interested in.
Quant CLI can be a very useful tool when working with Quant. Here are some resources to learn more:
- Automate static deployments with Quant CLI
- Deploying to Quant with GitHub Actions
- Quant CLI GitHub repo
- Get started with the Quant CLI documentation
Quant CLI crawler options
The Quant crawler leverages simplecrawler which is great web crawler! Once you have the CLI setup (checkout the resources above), it’s easy to use Quant’s crawler. The simplest case is outlined in the "Use the quant-cli tool to crawl your website" section of How to freeze and take static archive of your old site. Note, you do need to include http/https in your domain.
quant crawl https://yourdomain.com
Simplecrawler has a ton of configuration options, but we’ve simplified things a lot to support the most common use cases. Let’s touch upon the crawler parameters you can use:
concurrency (n) - Crawl concurrency
Maps to simplecrawler’s
maxConcurrency and defaults to 4. If you want more threads to run at the same time, you can update this. This can be useful if you have a large site and want it to be crawled faster. But, this will impact your site more so be careful with this setting. Example:
quant crawl --concurrency 10 https://yourdomain.com
cookies (c) - Accept cookies during the crawl
Cookies are a normal part of web browsing but often you don’t need cookies for crawling. Often cookies are used for tracking and analytics and can slow things down, so we default to false for saving cookies. If your website requires cookies to be saved to display the content correctly, you can enable this option.
quant crawl --cookies true https://yourdomain.com
extra-domains (e) - CSV of additional host names to fan out to
Added to the simplecrawler’s
domainWhitelist along with the main domain. If you want to include additional domains when crawling, you can with this option. This might be helpful if you have more than one or more associated websites. For example, a main site (www.example.com) along with a blog site (blog.example.com) and you want to combine them. You provide a comma-delimited list of domains without http/https.
quant crawl --extra-domains “blog.yourdomain.com,shop.yourdomain.com” https://yourdomain.com
interval (i) - Crawl interval (in milliseconds)
You want to make sure that you aren’t causing undue stress on the website you are crawling. The default is 200ms but you can tweak this to your needs. For example, if you wanted to increase the timing to 1 second, you could use the following. Knowing your website’s size, traffic, and load profile will help you understand if you want to change this.
quant crawl --interval 1000 true https://yourdomain.com
no-interaction - No user interaction
This one is pretty self-explanatory and defaults to false. If you don’t want to be prompted while the crawler is running, you can enable this option. This is helpful when running the crawler in a cron or other automated fashion.
quant crawl --no-interaction true https://yourdomain.com
rewrite (r) - Rewrite host patterns
If you want the host domains stripped out of the URLs in the content, you can set this option. The paths will be relative, e.g. https://yourdomain.com/some/content/link => /some/content/link
quant crawl --rewrite true https://yourdomain.com
robots - Respect robots
Maps to simplecrawler’s
respectRobotsTxt. Sometimes, the
robots.txt file or page’s
<meta> tags might be prohibitive for crawling which may prevent content or assets from being copied. This is why the crawler’s default for this setting is false. If you want to honor the
robots.txt file and
<meta> configuration, you can change this to true.
quant crawl --robots true https://yourdomain.com
seed-notfound - Send the content of unique not found pages to quant
If you want 404 pages to be saved when resources aren’t found during the crawl, you can use this option. Then you can check the Quant dashboard for these to see what content you need to fix.
quant crawl --seed-notfound true https://yourdomain.com
size (s) - Crawl resource buffer size in bytes
Maps to simplecrawler’s
maxResourceSize and defaults to 256MB. If you want to increase the buffer size, you can override this. For example, for 512MB, you could use:
quant crawl --size 536870912 https://yourdomain.com
skip-resume - Start a fresh crawl ignoring resume state
Sometimes you just want to start over! We do this all the time when we restart our computer or browser or whatever when there is a glitch. You can also restart your crawl from scratch using this option.
quant crawl --skip-resume true https://yourdomain.com
urls-file - JSON file containing array of URLs to add to the queue
If you want to just crawl a subset of URLs from your website, you can create a JSON file with your list and feed that into the crawler. This is nice for testing out the crawler as well as situations where you don’t need the whole website synced. Note, you have to use skip-resume for this option to take effect. For example,
quant crawl --skip-resume true --urls-file myurls.json https://yourdomain.com