The technology world is full of jargon and acronyms and funny words. In this Tech Speak series, we try to explain these in simple terms as well as provide additional information if you want to go deeper.
Term: Web Crawler
AKA: Web Spider
Oneliner: A web crawler or spider is a software tool that visits websites and gathers information from web pages.
Short description: A web crawler or web spider is software that visits websites and gathers data from their web pages. The crawler goes from one web page to another by following links on the web page just like you would do when using a web browser. Search engines like Google and Bing use crawlers but crawlers are useful for many things like checking for broken links on a website, detecting plagiarism on web pages, creating a backup copy of a website, and aggregating content like news and events.
Example use case: I want to go on a vacation in Australia and want to research different flights. While I could go to the website for each major airline that I know about and search flight info and prices from these individually, Google has made it simpler for me. I can just search for “direct flight from San Francisco to Sydney” and it will show the aggregated list of flights to choose from. Google has this data because it has used a web crawler to gather the information from the airline websites. This makes my research much easier. Once I’ve narrowed down a flight, I can go to that airline’s website and book it.
Keep in mind: Web crawlers can be used for a variety of things. They are a key technology used for search engines but you can also use them for something as simple as making a copy of a website.
What are web crawlers used for?
Web crawlers have been around since 1994 when the first search engine with full text was created, WebCrawler. Since then, some of the big names of search like Google and Bing were built with sophisticated crawlers as part of their search technology. But, crawlers/spiders can be used for more than just search engines. Here are some use cases:
Aggregating: A web crawler can find specific types of content such as news articles, job postings, and events so they can be aggregated and listed on another website or used for internal analysis.
Analyzing: You can analyze a website to check for broken links, plagiarism, readability, and search engine optimization (SEO). This type of analysis software uses a crawler to navigate through all the web pages.
Archiving: Crawlers are a great tool for creating a copy of a website for safe keeping. Whether or not you can fully use that website copy will depend on if the website has dynamic features or is a completely static site.
Monitoring: Web crawlers can be used to track changes or certain data on a website like prices, new content, and announcements. People can then get alerts so they can keep on top of the latest information.
Scraping: A crawler can be extended to “scrape” a website for certain information. For example, if you wanted to find competitive product information from many websites, you could use a crawler to gather the information for you for analysis.
Searching: A core part of a search engine is a web crawler or spider. The data gathered from crawling is organized and put into a “search index” and then used for showing search results.
Training: Web crawlers are used for gathering data from the internet and training artificial intelligence (AI) software.
What’s the difference between a web crawler and web scraper?
Usually when someone says “web crawler” or “web spider” they mean the same thing. It’s software that is used to navigate through a website from one page to another and gather content or data. Often they are associated with search engines.
A “web scraper” is software that pulls specific information from web pages. It might be the entire page or just specific information like prices or contact information. But a web scraper can be paired with a crawler to pull information from more than one web page.
Since a web scraper often grabs data from more than one page and a web crawler can be configured to grab information as it’s crawling a site, it’s easy to see how the definitions of these software tools are blurry.
How does a web crawler work?
We won’t go into the deep technical inner workings of web crawler technology, but we will explain some of the important aspects. Generally speaking, a web crawler (or spider) methodically navigates through a website, going through all the links, in order to gather some sort of information from each web page. Here are some key concepts of a web crawler:
Seeding: The term seeding sounds fancier than it is. You simply let the crawler know what websites it needs to crawl. It might be one site or a thousand or any other number you need.
Navigating: The crawler starts with the initial link that was provided during “seeding” and then it will follow the links on the page. The crawler can be set up to just follow links to web pages on the same website and ignore links to “external” websites. It can also be configured to only navigate a certain number of “jumps” away from the starting page. For example, if it takes 6 “clicks” to go from the home page to some deeply buried article on a web site, the crawler could be set up to only go 5 “clicks” away and this page would be ignored.
Parsing: The metadata of each web page needs to be understood such as the title of the page and other information used for search engines but not shown directly on the page. The crawler must parse out this metadata and associate it with the page’s information.
Downloading: Depending on the purpose of the crawler, it might download each web page as it navigates or it might be paired with a scraper to extract specific data within the page and just grab that.
Structuring data: The data the crawler collects will be stored in some structured way and handed off to another system, like a database or file system, for storage. This data can be used later for searching, research, or other uses.
Validating: The web crawler has some built-in validation like checking for duplicate pages so it doesn’t have extra data. Some website analysis software would report these duplicate pages so you can remove them.
Rules following: Crawlers are supposed to respect certain protocols. For example, there can be a “robots.txt” file on a website that can specify which search engine crawlers are allowed to crawl the website or if certain pages should be ignored. Bad crawlers will obviously ignore such rules, but popular search engine crawlers will follow these rules.
Performance handling: Good web crawlers will be nice to the websites they are crawling and limit the amount of crawling they do so they don’t overwhelm the website and make it crash.
Web crawler resources
Learn more about static websites and related concepts by checking out these resources:
- Web crawler (Wikipedia)
- Web scraping (Wikipedia)
- What is a robots.txt file (Moz)
- The Wayback Machine (Internet Archive)
- How Internet Search Engines Work (HowStuffWorks)
- What is Web Scraping and How to Use It? (GeeksforGeeks)
- Top 11 open source web crawlers - and one powerful web scraper (Apify)
- Top 20 Web Crawling Tools to Scrape the Websites Quickly (Octoparse)
- A collection of awesome web crawler, spider and resources in different languages (BruceDone on GitHub)