Open In App

Data Analysis with Unix – Part 2

Use of UNIX
So now we’ll find out the highest recorded global temperature in the dataset (for each year) using Unix?
The classic tool for processing line-oriented data is awk.

Small script to find the maximum temperature for each year in NCDC data




#!/user / bin / env bash
for year in all/*
do
    # using each year data file
    echo -ne `basename $year .gz`"\t"
    # substring to search in
    gunzip -c $year | \
        awk '{ temp = substr($0, 88, 5) + 0;
                q = substr($0, 93, 1);
                if (temp != 9999 && q ~ /[01459]/ && temp > max) max = temp }
            END { print max }'
done

The content circles through the packed year records, first printing the year, and after that preparing each record utilizing awk. The awk content concentrates two fields from the information: the air temperature and the quality code. The air temperature worth is transformed into a whole number by including 0. Next, a test is connected to see whether the temperature is legitimate (the worth 9999 means a missing an incentive in the NCDC dataset) and whether the quality code demonstrates that the perusing isn’t speculating or incorrect. On the off chance that the perusing is OK, the worth is contrasted and the greatest worth is seen up until this point, which is refreshed if another most extreme is found. The END block is executed after every one of the lines in the record has been prepared, and it prints the most maximum value.



Output of run: [Beginning of the Output]

% ./max_temperature.sh
1901    317
1902    244
1903    289
1904    256

The temperature esteems in the source record are scaled by a factor of 10, so this works out as a most extreme temperature of 31.7°C for 1901 (there were not very many readings at the start of the century, so this is conceivable). The total keep running for the century took 42 minutes in a single keep running on a solitary EC2 High-CPU Extra Large instance.
To accelerate the preparing, we have to run portions of the program in parallel. In principle, this is clear: we could process various years in various procedures, utilizing all the accessible equipment strings on a machine.



Problems –

Article Tags :