Data Analysis with Unix – Part 2

Last Updated : 30 Jul, 2019

Use of UNIX
So now we’ll find out the highest recorded global temperature in the dataset (for each year) using Unix?
The classic tool for processing line-oriented data is awk.

Small script to find the maximum temperature for each year in NCDC data

#!/user / bin / env bash 
for year in all/*
do 
    # using each year data file 
    echo -ne `basename $year .gz`"\t"
    # substring to search in 
    gunzip -c $year | \ 
        awk '{ temp = substr($0, 88, 5) + 0; 
                q = substr($0, 93, 1); 
                if (temp != 9999 && q ~ /[01459]/ && temp > max) max = temp } 
            END { print max }' 
done 

The content circles through the packed year records, first printing the year, and after that preparing each record utilizing awk. The awk content concentrates two fields from the information: the air temperature and the quality code. The air temperature worth is transformed into a whole number by including 0. Next, a test is connected to see whether the temperature is legitimate (the worth 9999 means a missing an incentive in the NCDC dataset) and whether the quality code demonstrates that the perusing isn’t speculating or incorrect. On the off chance that the perusing is OK, the worth is contrasted and the greatest worth is seen up until this point, which is refreshed if another most extreme is found. The END block is executed after every one of the lines in the record has been prepared, and it prints the most maximum value.

Output of run: [Beginning of the Output]

% ./max_temperature.sh
1901    317
1902    244
1903    289
1904    256

The temperature esteems in the source record are scaled by a factor of 10, so this works out as a most extreme temperature of 31.7°C for 1901 (there were not very many readings at the start of the century, so this is conceivable). The total keep running for the century took 42 minutes in a single keep running on a solitary EC2 High-CPU Extra Large instance.
To accelerate the preparing, we have to run portions of the program in parallel. In principle, this is clear: we could process various years in various procedures, utilizing all the accessible equipment strings on a machine.

Problems –

To begin with, isolating the work into equivalent size pieces isn’t in every case simple or self-evident. For this situation, the record estimate for various years shifts generally, so a few procedures will complete a lot prior than others.
Regardless of whether they get further work, the entire run is ruled by the longest record. A superior methodology, albeit one that requires more work, is to part the contribution to fixed-measure pieces and relegate each lump to a procedure.
Second, consolidating the outcomes from free procedures may require further preparing. For this situation, the outcome for every year is autonomous of different years, and they might be consolidated by connecting every one of the outcomes and arranging by year. On the off chance that utilizing the fixed-estimate lump approach, the mix is progressively fragile. For this model, information for a specific year will ordinarily be part into a few pieces, each handled autonomously. We’ll end up with the greatest temperature for each lump, so the last advance is to search for the most noteworthy of these maximums for every year.
Third, you are as yet restricted by the handling limit of a solitary machine. On the off chance that the best time you can accomplish in 20 minutes with the number of processors you have, at that point that is it. You can’t cause it to go quicker. Likewise, some datasets develop past the limit of a solitary machine. When we begin utilizing different machines, an entire host of different components become an integral factor, for the most part falling into the classifications of coordination and dependability.

Suggest improvement

Data Analysis with Unix - Part 1

Share your thoughts in the comments

Data Analysis with Unix – Part 2

Please Login to comment...

Similar Reads

What kind of Experience do you want to share?