Data Analysis with Unix – Part 1

To understand how to work with Unix, data – Weather Dataset is used.
Weather sensors gather information consistently at numerous areas over the globe and assemble an enormous volume of log information, which is a decent possibility for investigation with MapReduce in light of the fact that is required to process every one of the information, and the information is record-oriented and semi-organized.

The information utilized is from the National Climatic Data Center, or NCDC. The information is put away utilizing a line-arranged ASCII group, in which each line is a record. The organization underpins a rich arrangement of meteorological components, huge numbers of which are discretionary or with variable information lengths. For straightforwardness, centre around the fundamental components, for example, temperature, which is constantly present and are of fixed width.
Structure of NCDC record

332130       # USAF weather station identifier
99999        # WBAN weather station identifier
19500101     # observation date
0300         # observation time
+51317       # latitude ( degrees x 1000)
+028783      # longitude (degrees x 1000)
+0171        # elevation (meters)
320          # wind direction (degrees)
1            # quality code
N 0072
1 00450      # sky ceiling height (meters)
1            # quality code
010000       # visibility distance (meters)
1            # quality code
-0128       # air temperature (degrees Celsius x 10)
1           # quality code
-0139       # dew point temperature (degrees Celsius x 10)
1           # quality code
10268       # atmospheric pressure (hectopascals x 10)
1           # quality code

Note – Fields are packed into one line with no delimiters in the actual file we’ll be working on. Datafiles are sorted out by date and climate station. There is an index for every year from 1901 to 2001, each containing a gzipped record for each climate station with its readings for that year.

First entries for 1995 :

% ls raw/1990 | head

There are countless climate stations, so the entire dataset is comprised of a huge number of generally little documents. It’s commonly simpler and increasingly proficient to process a more modest number of generally enormous records, so the information was preprocessed with the goal that every year’s readings were linked into a solitary record.

My Personal Notes arrow_drop_up

Check out this Author's contributed articles.

If you like GeeksforGeeks and would like to contribute, you can also write an article using or mail your article to See your article appearing on the GeeksforGeeks main page and help other Geeks.

Please Improve this article if you find anything incorrect by clicking on the "Improve Article" button below.

Article Tags :
Practice Tags :

Be the First to upvote.

Please write to us at to report any issue with the above content.