什么是数据抓取?
Data scraping, in its most general form, refers to a technique in which a computer program extracts data from output generated from another program. Data scraping is commonly manifested in web scraping, the process of using an application to extract valuable information from a website.
What are different types of web scraping? Why scrape website data?
抓取机器人可以被设计用于多种用途,例如:
- Content scraping - a website’s content is pulled in order to replicate the unique advantage of a particular product or service that relies on content. Take a restaurant review site, for instance; a competitor could scrape all the reviews, then reproduce the content on their own website, pretending the content is original (and reaping the benefits).
- Price scraping - by scraping pricing data, competitors are able to aggregate information about their competition. This can allow them to formulate a unique advantage, namely by undercutting their competitors, thus taking their business.
- Contact scraping - a lot of websites contain email addresses and phone numbers in plaintext. By scraping pages such as online employee directories, a scraper can aggregate contact details to be used in bulk mailing lists, robo calls, or malicious social engineering attempts. This is one of the primary methods used by both spammers and scammers to find new targets.
数据抓取和数据爬虫有什么区别?
爬取是指 Google 等大型搜索引擎在发送其机器人爬网程序(如 Googlebot)到网络中以建立互联网内容索引时所进行的过程。另一方面,抓取则是明确从特定网站提取数据的一种典型架构。
Here are 3 differences in behavioral practice between scraper bots and web crawler bots:
|
Honesty/transparency |
Advanced maneuvers |
Respecting robots.txt |
Scraper bot |
Will pretend to be web browsers to get past any efforts to block scrapers. |
Can take advanced actions such as filling out forms in order to access gated information. |
Typically has no regard for robots.txt, meaning they can pull content explicitly against the website owner’s wishes. |
Crawler bot |
Will indicate its purpose, wouldn’t attempt to trick a website into thinking the crawler is something it’s not. |
Will not try to access gated parts of a website. |
Respects robots.txt, meaning they abide by the website owner’s wishes around what data to parse vs. what areas of the website to avoid. |
How are websites scraped?
The process of web scraping is fairly simple, though the implementation can be complex. We can summarize the process in 3 steps:
- First, the piece of code used to pull the information (the scraper bot) sends an HTTP GET request to a specific website.
- 当网站响应的时,抓取器将解析 HTML 文档以获取特定的数据模式。
- 数据提取后,被转换为抓取机器人设计者所设计的特定格式。
Typically, companies do not want their unique content to be downloaded and reused for unauthorized purposes, so they might try not to expose all data via a consumable API or other easily accessible resource. Scraper bots, on the other hand, are interested in getting website data regardless of any attempt at limiting access. As a result, a cat-and-mouse game exists between web scraping bots and various content protection strategies, with each trying to outmaneuver the other.
如何防护网页抓取?
Smart scraping strategies require smart mitigation strategies. Methods of limiting exposure to data scraping efforts include the following:
- Rate limit requests - for a human visitor clicking through a series of webpages on a website, the speed of interaction with the website is fairly predictable; you’ll never have a human browsing 100 webpages a second, for example. Computers, on the other hand, can make requests that are orders of magnitude faster than a human, and novice data scrapers may use unthrottled scraping techniques to attempt to scrape an entire website very quickly. By rate limiting the maximum number of requests a particular IP address can make over a given window of time, websites are able to protect themselves from exploitative requests and limit the amount of data scraping that can occur within that window.
- Modify HTML markup at regular intervals - data scraping bots rely on consistent formatting in order to effectively traverse website content and parse out data. One method of interrupting this workflow is to regularly change elements of the HTML markup. By nesting HTML elements, or changing other aspects of the markup, simple data scraping efforts will be hindered or thwarted. For instance, some websites will randomize some form of content protection modification every single time a webpage is rendered; others may update their front-end every few weeks to prevent longer-term data scraping efforts.
- Use challenges for high-volume requesters - another useful step in slowing content scrapers is requiring website visitors to answer a challenge that’s difficult for a computer to surmount. While a human can reasonably answer the challenge, a headless browser* most likely can’t, certainly not across many instances of the challenge.
- Another less common mitigation method calls for embedding content inside media objects like images. Because the content does not exist in a string of characters, copying the content is far more complex, requiring optical character recognition (OCR) to pull the data from an image file.
*A headless browser is a type of web browser, much like Chrome or Firefox, but it doesn’t have a visual user interface by default, allowing it to move much faster than a typical web browser. By essentially running at the level of a command line, a headless browser is able to avoid rendering entire web applications. Data scrapers write bots that use headless browsers to request data more quickly, as there is no human viewing each page being scraped
如何完全防止抓取?
The only way to guarantee a full stop to web scraping is to stop putting content on a website entirely. However, using an advanced bot management solution can help websites eliminate access for scraper bots.
Protect against scraping attacks with Cloudflare
Cloudflare Bot Management uses machine learning and behavioral analysis to identify malicious scraping activity, protecting unique content and preventing bots from abusing a web property. Similarly, Super Bot Fight Mode is designed to help smaller organizations defend against scrapers and other malicious bot activity, while giving them more visibility into their bot traffic.