To understand how to work with Unix, data – Weather Dataset is used.
Weather sensors gather information consistently at numerous areas over the globe and assemble an enormous volume of log information, which is a decent possibility for investigation with MapReduce in light of the fact that is required to process every one of the information, and the information is record-oriented and semi-organized.
The information utilized is from the National Climatic Data Center, or NCDC. The information is put away utilizing a line-arranged ASCII group, in which each line is a record. The organization underpins a rich arrangement of meteorological components, huge numbers of which are discretionary or with variable information lengths. For straightforwardness, centre around the fundamental components, for example, temperature, which is constantly present and are of fixed width.
Structure of NCDC record
0057 332130 # USAF weather station identifier 99999 # WBAN weather station identifier 19500101 # observation date 0300 # observation time 4 +51317 # latitude ( degrees x 1000) +028783 # longitude (degrees x 1000) FM-12 +0171 # elevation (meters) 99999 V020 320 # wind direction (degrees) 1 # quality code N 0072 1 00450 # sky ceiling height (meters) 1 # quality code C N 010000 # visibility distance (meters) 1 # quality code N 9 -0128 # air temperature (degrees Celsius x 10) 1 # quality code -0139 # dew point temperature (degrees Celsius x 10) 1 # quality code 10268 # atmospheric pressure (hectopascals x 10) 1 # quality code
Note – Fields are packed into one line with no delimiters in the actual file we’ll be working on. Datafiles are sorted out by date and climate station. There is an index for every year from 1901 to 2001, each containing a gzipped record for each climate station with its readings for that year.
First entries for 1995 :
% ls raw/1990 | head 010010-99999-1995.gz 010014-99999-1995.gz 010015-99999-1995.gz 010016-99999-1995.gz 010017-99999-1995.gz 010030-99999-1995.gz 010040-99999-1995.gz 010080-99999-1995.gz 010100-99999-1995.gz 010150-99999-1995.gz
There are countless climate stations, so the entire dataset is comprised of a huge number of generally little documents. It’s commonly simpler and increasingly proficient to process a more modest number of generally enormous records, so the information was preprocessed with the goal that every year’s readings were linked into a solitary record.
- Data Analysis with Unix - Part 2
- Essential Linux/Unix Commands
- grep command in Unix/Linux
- Sed Command in Linux/Unix with examples
- SORT command in Linux/Unix with examples
- Soft and Hard links in Unix/Linux
- Commands in Unix when things go wrong
- AWK command in Unix/Linux with examples
- tr command in Unix/Linux with examples
- Wget command in Linux/Unix
- Piping in Unix or Linux
- systemctl in Unix
- vi Editor in UNIX
- Introduction to UNIX System
- Process states and Transitions in a UNIX Process
- Environment Variables in Linux/Unix
- Linux vs Unix
- Pipes and Filters in Linux/Unix
- MapReduce Program - Weather Data Analysis For Analyzing Hot And Cold Days
- Virtual Machine for Malware Analysis
If you like GeeksforGeeks and would like to contribute, you can also write an article using contribute.geeksforgeeks.org or mail your article to firstname.lastname@example.org. See your article appearing on the GeeksforGeeks main page and help other Geeks.
Please Improve this article if you find anything incorrect by clicking on the "Improve Article" button below.