Open In App

Data Analysis with Unix – Part 1

Last Updated : 14 Jul, 2019
Improve
Improve
Like Article
Like
Save
Share
Report

To understand how to work with Unix, data – Weather Dataset is used.
Weather sensors gather information consistently at numerous areas over the globe and assemble an enormous volume of log information, which is a decent possibility for investigation with MapReduce in light of the fact that is required to process every one of the information, and the information is record-oriented and semi-organized.

The information utilized is from the National Climatic Data Center, or NCDC. The information is put away utilizing a line-arranged ASCII group, in which each line is a record. The organization underpins a rich arrangement of meteorological components, huge numbers of which are discretionary or with variable information lengths. For straightforwardness, centre around the fundamental components, for example, temperature, which is constantly present and are of fixed width.
Structure of NCDC record

0057
332130       # USAF weather station identifier
99999        # WBAN weather station identifier
19500101     # observation date
0300         # observation time
4 
+51317       # latitude ( degrees x 1000)
+028783      # longitude (degrees x 1000)
FM-12
+0171        # elevation (meters)
99999
V020
320          # wind direction (degrees)
1            # quality code
N 0072
1 00450      # sky ceiling height (meters)
1            # quality code
C
N 
010000       # visibility distance (meters)
1            # quality code
N
9 
-0128       # air temperature (degrees Celsius x 10)
1           # quality code
-0139       # dew point temperature (degrees Celsius x 10)
1           # quality code
10268       # atmospheric pressure (hectopascals x 10)
1           # quality code

Note – Fields are packed into one line with no delimiters in the actual file we’ll be working on. Datafiles are sorted out by date and climate station. There is an index for every year from 1901 to 2001, each containing a gzipped record for each climate station with its readings for that year.

First entries for 1995 :

% ls raw/1990 | head
010010-99999-1995.gz
010014-99999-1995.gz
010015-99999-1995.gz
010016-99999-1995.gz
010017-99999-1995.gz
010030-99999-1995.gz
010040-99999-1995.gz
010080-99999-1995.gz
010100-99999-1995.gz
010150-99999-1995.gz

There are countless climate stations, so the entire dataset is comprised of a huge number of generally little documents. It’s commonly simpler and increasingly proficient to process a more modest number of generally enormous records, so the information was preprocessed with the goal that every year’s readings were linked into a solitary record.


Like Article
Suggest improvement
Previous
Next
Share your thoughts in the comments

Similar Reads