Create a Newsletter Sourcing Data using MongoDB

Last Updated : 03 Apr, 2023

There are many news delivery websites available like ndtv.com. In this article, let us see the very useful and interesting feature of how to get the data from ndtv.com via scraping feature i.e. extracting the contents from ndtv.com and storing them into MongoDB. MongoDB is a NoSQL Documentum model database.

Using Mongoose, Node JS, and Cheerio, the NDTV news website is scraped and data is loaded into the Mongo DB database. This is a full-stack JavaScript app built using MongoDB, Mongoose, Node.js, Express.js, Handlebars.js, HTML, and CSS. It scrapes the [NDTV](https://ndtv.com/) homepage and stores article titles and links.

Module Installation: Install the required modules using the following command.

npm install body-parser
npm install cheerio
npm install express
npm install express-handlebars
npm install mongoose
npm install request

Project Structure: It will look like this.

Implementation:

Filename: server.js: This is the important file required to start the app running. To call the ndtv site, scrape the data, and store it in MongoDB database.

Javascript

// First specifying the required dependencies
// Express is a minimal and flexible Node.js
// web application framework that provides a
// robust set of features for web and mobile
// applications
const express = require("express");
 
// To communicate with mongodb, we require "mongoose"
const mongoose = require("mongoose");
 
// As we need to call ndtv website and access
// the urls, we require "request"
const request = require("request");
 
// Cheerio parses markup and provides an
// API for traversing/manipulating the
// resulting data structure
const cheerio = require("cheerio");
 
// Node.js body parsing middleware.
// Parse incoming request bodies in a
// middleware before your handlers,
// available under the req.body property.
const bodyParser = require("body-parser");
const exphbs = require("express-handlebars");
 
// We can explicitly set the port number
// provided no other instances running
// on that port
const PORT = process.env.PORT || 3000;
 
// Initialize Express
const app = express();
 
// Use body-parser for handling form submissions
app.use(bodyParser.urlencoded({
    extended: false
}));
 
// We are getting the output in the
// form of application/json
app.use(bodyParser.json({
    type: "application/json"
}));
 
// Serve the public directory
app.use(express.static("public"));
 
// Use promises with Mongo and connect to
// the database
// Let us have our mongodb database name
// to be ndtvnews By using Promise,
// Mongoose async operations, like .save()
// and queries, return thenables.
mongoose.Promise = Promise;
const MONGODB_URI = process.env.MONGODB_URI
    || "mongodb://localhost/ndtvnews";
 
mongoose.connect(MONGODB_URI);
 
// Use handlebars
app.engine("handlebars", exphbs({
    defaultLayout: "main"
}));
 
app.set("view engine", "handlebars");
 
// Hook mongojs configuration to the db variable
const db = require("./models");
 
// We need to filter out NdtvArticles from
// the database that are not saved
// It will be called on startup of url
app.get("/", function (req, res) {
 
    db.Article.find({
        saved: false
    },
        function (error, dbArticle) {
            if (error) {
                console.log(error);
            } else {
 
                // We are passing the contents
                // to index.handlebars
                res.render("index", {
                    articles: dbArticle
                });
            }
        })
})
 
// Use cheerio to scrape stories from NDTV
// and store them
// We need to do this on one time basis each day
app.get("/scrape", function (req, res) {
    request("https://ndtv.com/", function (error, response, html) {
 
        // Load the html body from request into cheerio
        const $ = cheerio.load(html);
 
        // By inspecting the web page we know how to get the
        // title i.e. headlines of news.
        // From view page source also we can able to get it.
        // It differs in each web page
        $("h2").each(function (i, element) {
 
            // The trim() removes whitespace because the
            // items return \n and \t before and after the text
            const title = $(element).find("a").text().trim();
            console.log("title", title);
            const link = $(element).find("a").attr("href");
            console.log("link", link);
 
            // If these are present in the scraped data,
            // create an article in the database collection
            if (title && link) {
                db.Article.create({
                    title: title,
                    link: link
                },
                    function (err, inserted) {
                        if (err) {
 
                            // Log the error if one is
                            // encountered during the query
                            console.log(err);
                        } else {
 
                            // Otherwise, log the inserted data
                            console.log(inserted);
                        }
                    });
 
                // If there are 10 articles, then
                // return callback to the frontend
                console.log(i);
                if (i === 10) {
                    return res.sendStatus(200);
                }
            }
        });
    });
});
 
// Route for retrieving all the saved articles.
// User has the option to save the article.
// Once it is saved, "saved" column in the
// collection is set to true.
// Below routine helps to find the articles
// that are saved
app.get("/saved", function (req, res) {
    db.Article.find({
        saved: true
    })
        .then(function (dbArticle) {
 
            // If successful, then render with
            // the handlebars saved page
            // this time saved.handlebars is
            // called and that page is rendered
            res.render("saved", {
                articles: dbArticle
            })
        })
        .catch(function (err) {
 
            // If an error occurs, send the
            // error back to the client
            res.json(err);
        })
});
 
// Route for setting an article to saved
// In order to save an article, this routine is used.
// _id column in collection is unique and it will
// determine the uniqueness of the news
app.put("/saved/:id", function (req, res) {
    db.Article.findByIdAndUpdate(
        req.params.id, {
        $set: req.body
    }, {
        new: true
    })
        .then(function (dbArticle) {
 
            // This time saved.handlebars is
            // called and that page is rendered
            res.render("saved", {
                articles: dbArticle
            })
        })
        .catch(function (err) {
            res.json(err);
        });
});
 
// Route for saving a new note to the db and
// associating it with an article
app.post("/submit/:id", function (req, res) {
    db.Note.create(req.body)
        .then(function (dbNote) {
            let articleIdFromString =
                mongoose.Types.ObjectId(req.params.id)
 
            return db.Article.findByIdAndUpdate(
                articleIdFromString, {
                $push: {
                    notes: dbNote._id
                }
            })
        })
        .then(function (dbArticle) {
            res.json(dbNote);
        })
        .catch(function (err) {
 
            // If an error occurs, send it
            // back to the client
            res.json(err);
        });
});
 
// Route to find a note by ID
app.get("/notes/article/:id", function (req, res) {
    db.Article.findOne({ "_id": req.params.id })
        .populate("notes")
        .exec(function (error, data) {
            if (error) {
                console.log(error);
            } else {
                res.json(data);
            }
        });
});
 
 
app.get("/notes/:id", function (req, res) {
    db.Note.findOneAndRemove({ _id: req.params.id },
        function (error, data) {
            if (error) {
                console.log(error);
            }
            res.json(data);
        });
});
 
// Listen for the routes
app.listen(PORT, function () {
    console.log("App is running");
});

Steps to run the application: Run the server.js file using the following command.

node server.js

Output: We will see the following output on the terminal screen.

App is running

Now open any browser and go to http://localhost:3000/, we will get a similar page like below.

To get the news from ndtv.com, we need to click on Get New Articles. This will internally call our /scrape path. Once this call is done, in MongoDB, under ndtvnews database, articles named collection got filled with the data as shown below:

articles collection

Here, the initially saved attribute will be false, id is automatically got created in MongoDB and this is the unique identification of a document in a collection. This attribute only helps to view a document, save a document, etc.

Extracted articles are displayed in this format

On clicking on View article on NDTV, it will navigate to the respective article. This is getting possible only because of id attribute which is present in the articles collection. So, when we click on View article on NDTV, as it is a hyperlink, directly that document _id value is internally picked up and the link is displayed. When the Save article is clicked, the _Id value will be the identification part for that article.

Working: Entire working model of the project is explained in the video:

Conclusion: It is easier and simpler to scrape any news website and display the title contents alone along with a link that follows to proceed, and we can save the article and check out the saved articles easily.

Reference: https://github.com/raj123raj/NdtvNewsScraperUsingMongoDB

Suggest improvement

How to highlight syntax in files using Node.js ?

How to extract the user name from the email ID using PHP ?

Share your thoughts in the comments