Puppeteer.JS – Using Headless Chrome for Site Crawling

PuppeteerJS essentially allows you to automate Chrome.
Headless Chrome allows you to run Chrome without actually rendering the webpage. Sounds silly, but has a lot of useful applications, you could for example simply write a test script that ensures that your website is still working correctly.

Installation

npm i puppeteer
# or
yarn add puppeteer

Usage

We are going to look at a quick example of how to log In to a site and then do some operation.

Inititalize Puppeteer

You need to run it in an async function, simply because you do not know how long it will take until chrome has started.
so with

We start our browser. The flag headless is set to ‘true’ as default, however for debugging purposes, you should set it to ‘false’;

Login

To Login into the site we need three things:
* The URL for the Login Page
* CSS Selector for the Username Field
* CSS Selector for the Password Field

To obtain the the Selectors you can use the Chrome DevTools (F12). Simply select the HTML Field and with Rightclick select Copy Selector.

Fetch all Links

Now since you are logged in to the site, you can navigate to any site and fetch all the links.

Final Code

Udacity – Web Tooling and Automatisation

I recently took a look at the course materials for Web Tooling and Automatisation.

Overall the course is very well structured and introduces Gulp and a couple of common packages used in webdevelopment. Besides their main topic, they cover topics on good engineering practices, like linting and testing to ensure code quality.

While working on the project I ran into several little smaller things that were quite annoying. Thankfully the gulp community is quite big, so somebody already solved some of the issues I was facing.

Passing an “--production” flag

When developing, you will probably create a version of your software that is suited for easily finding bugs and errors and an optimized version that is minified and optimized for optimal performance for the end user.

You would define two different tasks in gulp, one “default” and one “production” task. This however would in turn cause you to have to duplicate your code – with optimization and without.

I found the package “gulp-if” that allows you to control if a function like compression is active during the task.
The remaining issue was to actually set the parameter before the tasks run. (All tasks in gulp run in parallel).

To get a flag from the command line, you can use the process.argv Array. However you must add “–” before your flag name. If not gulp will assume it is another task name that should run.

In the end you would use something like this:

Note: In Gulp 4, you can use a sequencer and would not need to pass in the flag by command-line, but you would define a task that will run before all the other tasks.

Dealing with Asset sources and destinations

When using gulp.src() and gulp.dest(), typically people use strings to define the locations. However this is quite annoying if you want to get a quick overview which locations are used. For a better maintainability you should create a small variable block that defines these strings. In the long run it lets you be more flexible where your files are etc.

End Result

At the end of the course I ended up with this gulpfile.js. It adds support for Typescript, Pug(Jade), google-closure-compiler.

The common gulp tasks to run are:
* gulp serve: Uses browser-sync with css injection for live-editing
* gulp --production: Creates an optimized build

Next steps:
Depending on your webserver, you would want to add a gulp deploy task.

gulpfile.js

Package.json