Data Analysis with Unix – Part 1
To understand how to work with Unix, data – Weather Dataset is used.
Weather sensors gather information consistently at numerous areas over the globe and assemble an enormous volume of log information, which is a decent possibility for investigation with MapReduce in light of the fact that is required to process every one of the information, and the information is record-oriented and semi-organized.
The information utilized is from the National Climatic Data Center, or NCDC. The information is put away utilizing a line-arranged ASCII group, in which each line is a record. The organization underpins a rich arrangement of meteorological components, huge numbers of which are discretionary or with variable information lengths. For straightforwardness, centre around the fundamental components, for example, temperature, which is constantly present and are of fixed width.
Structure of NCDC record
0057 332130 # USAF weather station identifier 99999 # WBAN weather station identifier 19500101 # observation date 0300 # observation time 4 +51317 # latitude ( degrees x 1000) +028783 # longitude (degrees x 1000) FM-12 +0171 # elevation (meters) 99999 V020 320 # wind direction (degrees) 1 # quality code N 0072 1 00450 # sky ceiling height (meters) 1 # quality code C N 010000 # visibility distance (meters) 1 # quality code N 9 -0128 # air temperature (degrees Celsius x 10) 1 # quality code -0139 # dew point temperature (degrees Celsius x 10) 1 # quality code 10268 # atmospheric pressure (hectopascals x 10) 1 # quality code
Note – Fields are packed into one line with no delimiters in the actual file we’ll be working on. Datafiles are sorted out by date and climate station. There is an index for every year from 1901 to 2001, each containing a gzipped record for each climate station with its readings for that year.
First entries for 1995 :
% ls raw/1990 | head 010010-99999-1995.gz 010014-99999-1995.gz 010015-99999-1995.gz 010016-99999-1995.gz 010017-99999-1995.gz 010030-99999-1995.gz 010040-99999-1995.gz 010080-99999-1995.gz 010100-99999-1995.gz 010150-99999-1995.gz
There are countless climate stations, so the entire dataset is comprised of a huge number of generally little documents. It’s commonly simpler and increasingly proficient to process a more modest number of generally enormous records, so the information was preprocessed with the goal that every year’s readings were linked into a solitary record.