The world of frontend testing and web scraping is evolving constantly. The more web technologies are there, the more complex websites become. And when things become complex, we need to test them.
Puppeteer is an open-source library developed by Google that was built in the purpose of automating and simplifying front-end tests and development. It is based on chromium, a popular version of Google Chrome, and can be 100% remotely controlled, allowing web developers to write and maintain simple, fully automated tests.
Testing? Well yes. Testing and web scraping are connected closely. When most frontend developers think about web testing, the word “Selenium” comes in mind. Selenium was the market leader in these fields until recently, but as in every empire that ruled the world, comes a time for a new king to rise.
Both libraries are open source web-testing libraries, and both are a really good option for web scraping.
The main differences are that Puppeteer is a Node.js library and supports only Chrome automation. It does not allow you to operate different types of browsers and use other programming languages than Node.js. Some open-source libraries are being written to overcome this limitation, but I wouldn’t recommend using them, since it’s like going to a business meeting with a translator. You might get the deal, but it will take more time and the chances of having misunderstandings are higher.
Selenium was the world leader in the testing and automation industries for many years. It has many advantages, including being able to test websites using multiple browser types. It is based on external drivers and basically “drives” these websites according to the developer’s commands.
Puppeteer is based on Chromium, a version of the famous browser Google Chrome, and therefore works only with Chrome. So, if you are trying to check your website’s compatibility with different browsers, Selenium is still your best bet.
With Puppeteer, just like its name implies, you have amazing and simple control over the world’s most popular browser. Since Puppeteer is based on Chromium’s internal API, the possibilities you have for controlling it are almost endless. Any operation you have in mind is probably supported: taking a screenshot, saving a webpage to PDF and working with async operations, changing your location, and many more.
But the three main reasons that I really like it are page interceptions, full browser automation, and headless browsing.
Puppeteer’s page interceptions, is the ability to intercept each and every network call that is being initiated from the webpage and logging it, changing it, and most importantly: ignoring it. This is key in order to reduce your proxy costs.
If you are only interested in getting a certain ‘div’ or ‘span’ elements, if you want to avoid script files to be loaded, or if you want only text items to be loaded, interceptions will be your best friend.
Using efficient interceptions can reduce your data usage by up to 90%.
The second thing I love about Puppeteer is the fact that it is actually using a real Chromium browser. It is not faking anything or using some external driver libraries. Just pure browser operations that are now available for use with this wonderful library. Are you concerned about the extra resources needed to run a real browser? Headless Mode is the solution here.
Browsing in headless mode is the way machines use the internet. It is operating a full browser but rendering and displaying none of the data. As you can expect, it is much faster, saves a lot of CPU, GPU, and memory resources and it is necessary if you want to operate your scraping or tests using tools such as Docker.
In both testing and web scraping, you get data from a website using automation. The main difference is of course the purpose of the operation. Web testing is usually done in-house, in order of making sure a website or frontend part of an app is working well. Web scraping on the other hand is extracting data from a specific website when the operator of this operation is usually not the owner of this website. Therefore, web scraping might need some more elegance in order not to be detected, and probably a good Proxy.
When scraping websites, it is usually very easy to detect operation. Every time we scrape data from a website, our browser leaves a “fingerprint” with details about the operating system, time zone, language, and many more. The most important detail, based on which most websites block requests, is your IP address.
This is very easy to detect browser/user-specific detail and it is impossible to fake. The only way to do this is by using a proxy service or a VPN.
When configuring Puppeteer to use a Proxy Service, the IP Address of each request is shown as the proxy server’s address. Therefore, even if the IP address gets blocked, you can always request a new address.
Puppeteer is a rising force in the world of testing and web scraping. It has recently upgraded to a new 2.0 version which now includes many more features such as time zone change.
If you are looking for a decent tool to learn and use for your scraping or testing needs, I highly recommend it.