Proof of Concept on News Aggregator using Big Data Technologies
Big Data is a huge dataset that can have a high volume of data, velocity, and variety of data. For example, billions of users searching on Google at the same time and that will be a very large dataset. In this, we will discuss Proof of concept(POC) on news aggregator using Big Data (Hadoop, hive, pig). And will perform operations based on MapReduce Operations. To perform the operation, we will use HiveQL(Hive Query Language) which is a SQL-like querying language that can process structured data using Hive. Hive is used to make querying and analyzing easy. And It is a data warehouse tool on top of Hadoop.
You will see the implementation approach how you can do POC on a news aggregator using Big Data. Here, we will do POC and will be able to find all the queries using big data technologies like Hadoop, hive, and pig. And Queries like the number of news divided into different categories, count the total occurrence of different titles in a table, publisher name, a query on the news which was published, a query for finding title name, and a query on finding the alphanumeric id of the cluster which includes news about the same story, etc. Let’s discuss one by one.
Proof of Concept on News Aggregator:
- This POC is based on newsaggregator data.
- Public DATASET is available below the website link.
Industry Social Media:
A publicly available dataset with attributes like as follows.
- ID –An integer number of numeric ID.
- TITLE -News Title of type String.
- URL –URL of type String.
- PUBLISHER -Publisher name of type String.
- CATEGORY -News category of type String.
- STORY –Alphanumeric ID of the cluster that includes news about the same story.
- HOSTNAME -URL hostname of type String.
- TIME -Approximate time the news was published.
- Find no of news divided into different categories
- Count the total occurrence of different titles in a table.
- Find publisher name and title of business category.
- Find the news which was published for an approximate time.
- Find 5 title names from the table which is published by the Los Angeles Times.
- Find the alphanumeric id of the cluster which includes news about the same story.
The purpose of this shell script is to create a table and execute the hive command to store the result.
Creating Table: To create a table using the following query as follows.
hive>create table new ( id bigint, title String, url String, publishername String, category String, story String, hostname String, time bigint ); > row format delimited > fields terminated by '\t' > lines terminated by '\n' > stored as textfile;
Loading Tables: To load the tables using the following query as follows.
hive>load data local inpath ‘/home/training/Desktop/news.txt’ >overwrite into table news;
Output: To show the output used the following query.
hive>select * from news;
1. Find a numberews divided into different categories.
hive >SELECT category, COUNT(*) from news GROUP BY category
2. Count the total occurrence of different titles in a table.
hive > select count (DISTINCT title) from news
3. Find publisher name and title of business category.
hive >select title , publishername from news where category==’b’;
4. Find the news which was published for an approximate time.
hive >SELECT * from news SORT BY time DESC limit 1;
5. Find 5 title names from the table which is published by the Los Angeles Times.
hive> SELECT title FROM news where publishername='Los Angeles Times' LIMIT 5;
6. Find the alphanumeric id of the cluster which includes news about the same story.
hive>SELECT story, COUNT(*) from news GROUP BY story;