Related Articles

Related Articles

How to Scrape a Website Using Puppeteer in Node.js ?
  • Last Updated : 29 Oct, 2020

Puppeteer is a Node.js library which provides a high-level API to control headless Chrome or Chromium over the DevTools Protocol. It allows automating, testing and scraping of web pages over a headless/headful browser.

Installing Puppeteer: To use puppeteer, you must have Node.js installed. Then, Puppeteer can be installed in the command line using the npm package manager.

npm install puppeteer

Using Puppeteer: The Puppeteer library can be imported in your script using:

const puppeteer = require('puppeteer');

It is important to remember that Puppeteer is a promise-based library which performs asynchronous calls to the headless Chrome instance. Therefore, we wrap it in an async wrapper. This means that the code is executed immediately.

Here is a simple example to take a screenshot of a page:



Javascript

filter_none

edit
close

play_arrow

link
brightness_4
code

import Puppeteer
const puppeteer = require('puppeteer');
  
(async () => {
    const browser = await puppeteer.launch();
      
    // Open new page in headless browser
    const page = await browser.newPage(); 
      
    // To visit page in browser
    await page.goto('https://scrapethissite.com');
      
    // Save Screenshot at Path
    await page.screenshot({path: 'screenshot.png'});
    
      // Close our browser instance
    await browser.close();
  })();

chevron_right


Running your Code: Save your code as a JavaScript file and run it in the command line using the following command-

node filename.js

Example: The following code returns an object with the NHL Hockey Team Name and Wins for that year,

Javascript

filter_none

edit
close

play_arrow

link
brightness_4
code

const puppeteer = require('puppeteer');
  
(async () => {
    const browser = await puppeteer.launch();
    const page = await browser.newPage();
    await page.goto('https://scrapethissite.com/pages/forms/');
  
    const textsArray = await page.evaluate(
        () => [...document.querySelectorAll(
            '#hockey > div > table > tbody > tr > td.name')]
            .map(elem => elem.innerText)
    );
    const WinArray = await page.evaluate(
        () => [...document.querySelectorAll(
            '#hockey > div > table > tbody > tr > td.wins')]
            .map(elem => elem.innerText)
    );
    var result = {};
    textsArray.forEach((textsArray, i) => 
        result[textsArray] = WinArray[i]);
    console.log(result);
    await browser.close();
})();

chevron_right


Output:

{ ‘Boston Bruins’: ’36’, ‘Buffalo Sabres’: ’31’, ‘Calgary Flames’: ’31’, ‘Chicago Blackhawks’: ’36’,
‘Detroit Red Wings’: ’34’, ‘Edmonton Oilers’: ’37’, ‘Hartford Whalers’: ’31’, ‘Los Angeles Kings’: ’46’,
‘Minnesota North Stars’: ’27’, ‘Montreal Canadiens’: ’39’, ‘New Jersey Devils’: ’32’, ‘New York Islanders’: ’25’,
‘New York Rangers’: ’36’, ‘Philadelphia Flyers’: ’33’, ‘Pittsburgh Penguins’: ’41’, ‘Quebec Nordiques’: ’16’,
‘St. Louis Blues’: ’47’, ‘Toronto Maple Leafs’: ’23’, ‘Vancouver Canucks’: ’28’, ‘Washington Capitals’: ’37’,
‘Winnipeg Jets’: ’26’}




My Personal Notes arrow_drop_up
Recommended Articles
Page :