How to Scrape a Website Using Puppeteer in Node.js ?

Puppeteer is a Node.js library which provides a high-level API to control headless Chrome or Chromium over the DevTools Protocol. It allows automating, testing and scraping of web pages over a headless/headful browser.

Installing Puppeteer: To use puppeteer, you must have Node.js installed. Then, Puppeteer can be installed in the command line using the npm package manager.

npm install puppeteer

Using Puppeteer: The Puppeteer library can be imported in your script using:

const puppeteer = require('puppeteer');

It is important to remember that Puppeteer is a promise-based library which performs asynchronous calls to the headless Chrome instance. Therefore, we wrap it in an async wrapper. This means that the code is executed immediately.

Here is a simple example to take a screenshot of a page:



Javascript

filter_none

edit
close

play_arrow

link
brightness_4
code

import Puppeteer
const puppeteer = require('puppeteer');
  
(async () => {
    const browser = await puppeteer.launch();
      
    // Open new page in headless browser
    const page = await browser.newPage(); 
      
    // To visit page in browser
    await page.goto('https://scrapethissite.com');
      
    // Save Screenshot at Path
    await page.screenshot({path: 'screenshot.png'});
    
      // Close our browser instance
    await browser.close();
  })();

chevron_right


Running your Code: Save your code as a JavaScript file and run it in the command line using the following command-

node filename.js

Example: The following code returns an object with the NHL Hockey Team Name and Wins for that year,

Javascript

filter_none

edit
close

play_arrow

link
brightness_4
code

const puppeteer = require('puppeteer');
  
(async () => {
    const browser = await puppeteer.launch();
    const page = await browser.newPage();
    await page.goto('https://scrapethissite.com/pages/forms/');
  
    const textsArray = await page.evaluate(
        () => [...document.querySelectorAll(
            '#hockey > div > table > tbody > tr > td.name')]
            .map(elem => elem.innerText)
    );
    const WinArray = await page.evaluate(
        () => [...document.querySelectorAll(
            '#hockey > div > table > tbody > tr > td.wins')]
            .map(elem => elem.innerText)
    );
    var result = {};
    textsArray.forEach((textsArray, i) => 
        result[textsArray] = WinArray[i]);
    console.log(result);
    await browser.close();
})();

chevron_right


Output:

{ ‘Boston Bruins’: ’36’, ‘Buffalo Sabres’: ’31’, ‘Calgary Flames’: ’31’, ‘Chicago Blackhawks’: ’36’,
‘Detroit Red Wings’: ’34’, ‘Edmonton Oilers’: ’37’, ‘Hartford Whalers’: ’31’, ‘Los Angeles Kings’: ’46’,
‘Minnesota North Stars’: ’27’, ‘Montreal Canadiens’: ’39’, ‘New Jersey Devils’: ’32’, ‘New York Islanders’: ’25’,
‘New York Rangers’: ’36’, ‘Philadelphia Flyers’: ’33’, ‘Pittsburgh Penguins’: ’41’, ‘Quebec Nordiques’: ’16’,
‘St. Louis Blues’: ’47’, ‘Toronto Maple Leafs’: ’23’, ‘Vancouver Canucks’: ’28’, ‘Washington Capitals’: ’37’,
‘Winnipeg Jets’: ’26’}




My Personal Notes arrow_drop_up

Check out this Author's contributed articles.

If you like GeeksforGeeks and would like to contribute, you can also write an article using contribute.geeksforgeeks.org or mail your article to contribute@geeksforgeeks.org. See your article appearing on the GeeksforGeeks main page and help other Geeks.

Please Improve this article if you find anything incorrect by clicking on the "Improve Article" button below.


Article Tags :

Be the First to upvote.


Please write to us at contribute@geeksforgeeks.org to report any issue with the above content.