Skip to content
Related Articles

Related Articles

Integration of Hadoop and R Programming Language
  • Last Updated : 17 Jan, 2021

Hadoop is an open-source framework that was introduced by the ASF — Apache Software Foundation. Hadoop is the most crucial framework for coping with Big Data. Hadoop has been written in Java, and it is not based on OLAP (Online Analytical Processing). The best part of this big data framework is that it is scalable and can be deployed for any type of data in various varieties like the structured, unstructured, and semi-structured type.  Hadoop is a middleware tool that provides us with a platform that manages a large and complex cluster of computers that was developed in Java and although Java is the main programming language for Hadoop other languages could be used to- R, Python, or Ruby. 

The Hadoop framework includes :

  • Hadoop Distributed File System (HDFS)It is a file system that provides a robust distributed file system. Hadoop has a framework that is used for job scheduling and cluster resource management whose name is YARN.
  • Hadoop MapReduceIt is a system for parallel processing of large data sets that implement the MapReduce model of distributed programming.

Hadoop extends an easier distributed storage with the help of HDFS and provides an analysis system through MapReduce. It has a well-designed architecture to scale up or scale down the servers as per the requirements of the user from one to hundreds or thousands of computers, having a high degree of fault tolerance. Hadoop has proved its infallible need and standards in big data processing and efficient storage management, it provides unlimited scalability and is supported by major vendors in the software industry. 

As we know that data is the precious thing that matters most for an organization and it’ll be not an exaggeration if we say data is the most valuable asset. But in order to deal with this huge structure and unstructured we need an effective tool that could effectively do the data analysis, so we get this tool by merging the features of both R language and Hadoop framework of big data analysis, this merging result increment in its scalability. Hence, we need to integrate both then only we can find better insights and result from data. Soon we’ll go through the various methodologies which help to integrate these two.

R is an open-source programming language that is extensively used for statistical and graphical analysis. R supports a large variety of Statistical-Mathematical based library for(linear and nonlinear modeling, classical-statistical tests, time-series analysis, data classification, data clustering, etc) and graphical techniques for processing data efficiently. 



One major quality of R’s is that it produces well-designed quality plots with greater ease, including mathematical symbols and formulae where needed. If you are in a crisis of strong data-analytics and visualization features then combining this R language with Hadoop into your task will be the last choice for you to reduce the complexity. It is a highly extensible object-oriented programming language and it has strong graphical capabilities. 

Some reasons for which R is considered the best fit for data analytics :

  • A robust collection of packages
  • Powerful data visualization techniques
  • Commendable Statistical and graphical programming features
  • Object-oriented programming language
  • It has a wide smart collection of operators for calculations of arrays, particular matrices, etc
  • Graphical representation capability on display or on hard copy.

The Main Motive behind R and Hadoop Integration :

No suspicion, that R is the most picked programming language for statistical computing, graphical analysis of data, data analytics, and data visualization. On the other hand, Hadoop is a powerful Bigdata framework that is capable to deal with a large amount of data. In all the processing and analysis of data the distributed file system(HDFS) of Hadoop plays a vital role, It applies the map-reduce processing approach during data processing(provides by rmr package of R Hadoop), Which make the data analyzing process more efficient and easier.

What would happen, if both collaborate with each other? Obviously, the efficiency of the data management and analyzing process will get increase multiple times. So, in order to have efficiency in the process of data analytics and visualization process, we have to combine R with Hadoop.

After joining these two technologies, R’s statistical computing power becomes increase, then we enable to :

  • Use Hadoop for the execution of the R codes.
  • Use R for accessing the data stored in Hadoop.

Several ways using which One can Integrate both R and Hadoop:

The most popular and frequently picked methods are shown below but there are some other RODBC/RJDBC  that could be used but not popular as below methods are. The general architecture of the analytics tools integrated with Hadoop is shown below along with its different layered structure as follows.



The first layer: It is the hardware layer — it consists of a cluster of computers systems, 

The second layer: It is the middleware layer of Hadoop. This layer also takes care of the distributions of the files flawlessly through using HDFS and the features of the MapReduce job. 

The third layer: It is the interface layer that provides the interface for analysis of data.  At this level, we can use an effective tool like Pig which provides a high-level platform to us for creating MapReduce programs using a language which we called Pig-Latin. We can also use Hive which is a data warehouse infrastructure developed by Apache and built on top of Hadoop. Hive provides a number of facilities to us for running complex queries and helps to analyze the data using an SQL-like language called HiveQL and it also extends support for implementing MapReduce tasks. 

Besides using Hive and Pig, We can also use Rhipe or Rhadoop libraries that build an interface to provide integration between Hadoop and R and enables users to access data from the Hadoop file system and enable to write his own script to implement the Map and Reduce jobs, or we can also use the Hadoop- streaming that is a technology which is used to integrate the Hadoop.

a) R Hadoop : R Hadoop method includes four packages, which are as follows:

  • The rmr package –rmr package provides Hadoop MapReduce functionality in R. So, the R programmer only has to do just divide the logic and idea of their application into the map and reduce phases associates and just submit it with the rmr methods. After that, The rmr package make a call to the Hadoop streaming and the MapReduce API through multiple job parameters as input directory, output directory, reducer, mapper, and so on, to perform the R MapReduce job over Hadoop cluster(most of the components are similar as Hadoop streaming).
  • The rhbase package –Allows R developer to connect Hadoop HBASE to R using Thrift Server. It also offers functionality like (read, write, and modify tables stored in HBase from R).

 A script that utilizes the RHаdoop functionality looks like the figure shown below as follows.

library(rmr)
map<-function(k,v){...}
reduce<-function(k,vv){...}
mapreduce(
input = "data.txt",
output ="output",
textinputformat = rawtextinputformat,
map = map,
reduce = reduce
)
  • The rhdfs package –It provides HDFS file management in R, because data itself stores in Hadoop file system. Functions of this package are as given as follows. File Manipulations -( hdfs.delete, hdfs.rm, hdfs.del, hdfs.chown, hdfs.put, hdfs.get etc), File Read/Write -(hdfs.flush, hdfs.read, hdfs.seek, hdfs.tell, hdfs.line.reader etc), Directory -hdfs.dircreate, hdfs.mkdir, Initialization: hdfs.init, hdfs.defaults.
  • The plyrmr package –It provides functionality likes data manipulation, summaries of the output result, performing set operations(union, intersection, subtraction, merge, unique).

b) RHIPE :Rhipe is used in R to do an intricate analysis of the large collection of data sets via Hadoop is an integrated programming environment tool that is brought by the Divide and Recombine (D & R) to analyze the huge amount of data. 

RHIPE = R and Hadoop Integrated Programming Environment

RHIPE is a package of R that enables the use of API in Hadoop. Thus, this way we can read, save the complete data that is created using RHIPE MapReduce. RHIPE is deployed with many features that help us to effectively interact with HDFS.  An individual can also use various languages like Perl, Java, or Python to read data sets in RHIPE. The general structure of the R script that uses Rhipe is shown below as follows.

library(Rhipe)
rhint(TRUE, TRUE);
map<-expression({lapply(map.values, function(mapper)...)})
reduce<-expression(
pre = {...},
reduce = {...},
post = {...}, }
x <- rhmr(
map = map,reduce = reduce,
ifolder = inputPath,
ofolder= outputPath,
inout=c('text','text'),
jobname= 'a job name'))
rhex(z)

Rhipe allows the R user to create MapReduce jobs(rmr package also help to do this job) that work entirely within the R environment using R expressions. This MapReduce functionality: allows an analyst to quickly specify Maps and Reduces using the full power, flexibility, and expressiveness of the R interpreted language.

c) Oracle R Connector for Hadoop (ORCH) : 

Orch is a collection of R packages that provide the following features.

  1. Various attractive Interfaces to work with the data maintained in Hive tables, able to use the Apache Hadoop based computing infrastructure, and also provides the local R environment and Oracle database tables.
  2. Us a predictive analytic technique, written in R or Java as Hadoop MapReduce jobs, that can be applied to data stored in HDFS files

After installing this package in R you’ll become able to do the various functions as follows.

  • Able to make the easier access and transform HDFS data using a Hive-enabled transparency layer for general use,
  • We enable to use the R language for writing mappers and reducers effectively,
  • Copying of data between the R memory to the local file system, to the HDFS, to the Hive, and to the Oracle databases,
  • Able to Schedule the R programs easily in order to execute the program as Hadoop MapReduce jobs and return the results to any of those corresponding locations etc.

Oracle R Connector for Hadoop enables access from a local client of R to Apache Hadoop using the following function prefixes:

  1. Hadoop –Identifies functions that provide an interface to Hadoop MapReduce.
  2. hdfs –Identifies functions that provide an interface to HDFS.
  3. Orch –Identifies a variety of functions; orch is a general prefix for ORCH functions.
  4. Ore –Identifies functions that provide an interface to a Hive data store.

d) Hadoop Streaming: Hadoop streaming is a Hadoop utility for running the Hadoop MapReduce job with executable scripts such as Mapper and Reducer. The script is available as part of the R package on CRAN. And its aim is to make R more accessible to the Hadoop streaming based applications. 

 This is just congruent to the pipe operation in Linux. With this, the text input file is printed on stream (stdin), which is provided as an input to Mapper, and the output (stdout) of Mapper is provided as an input to the Reducer; finally, Reducer writes the output to the HDFS directory. 

А command line with mаp аnd reduce tasks implemented аs R scripts would look like the following.

$ ${HADOOP_HOME}/bin/Hadoop jar
$ {HADOOP_HOME}/contrib/streaming/*.jar\
-inputformat org.apache.hadoop.mapred.TextInputFormat \
-input input_data.txt \
-output output \
-mapper /home/tst/src/map.R \
-reducer /home/tst/src/reduce.R \
-file /home/ts/src/map.R \
-file /home/tst/src/reduce.R

The main benefit of the Hadoop streaming is to allow the execution of the Java, as well as non-Java based programmed MapReduce jobs over Hadoop clusters. The Hadoop streaming supports various languages like Perl, Python, PHP, R, and C++, and other programming languages efficiently. Various components of the Hadoop streaming MapReduce job.

My Personal Notes arrow_drop_up
Recommended Articles
Page :