Puppeteer.JS – Using Headless Chrome for Site Crawling

PuppeteerJS essentially allows you to automate Chrome.
Headless Chrome allows you to run Chrome without actually rendering the webpage. Sounds silly, but has a lot of useful applications, you could for example simply write a test script that ensures that your website is still working correctly.

Installation

npm i puppeteer
# or
yarn add puppeteer

Usage

We are going to look at a quick example of how to log In to a site and then do some operation.

Inititalize Puppeteer

You need to run it in an async function, simply because you do not know how long it will take until chrome has started.
so with

We start our browser. The flag headless is set to ‘true’ as default, however for debugging purposes, you should set it to ‘false’;

Login

To Login into the site we need three things:
* The URL for the Login Page
* CSS Selector for the Username Field
* CSS Selector for the Password Field

To obtain the the Selectors you can use the Chrome DevTools (F12). Simply select the HTML Field and with Rightclick select Copy Selector.

Fetch all Links

Now since you are logged in to the site, you can navigate to any site and fetch all the links.

Final Code