External sorting is a term for a class of sorting algorithms that can handle massive amounts of data. External sorting is required when the data being sorted do not fit into the main memory of a computing device (usually RAM) and instead, they must reside in the slower external memory (usually a hard drive). External sorting typically uses a hybrid sort-merge strategy. In the sorting phase, chunks of data small enough to fit in main memory are read, sorted, and written out to a temporary file. In the merge phase, the sorted sub-files are combined into a single larger file.
One example of external sorting is the external merge sort algorithm, which sorts chunks that each fit in RAM, then merges the sorted chunks together. We first divide the file into runs such that the size of a run is small enough to fit into main memory. Then sort each run in main memory using merge sort sorting algorithm. Finally merge the resulting runs together into successively bigger runs, until the file is sorted.
Below are the steps used in C++ implementation.
input_file : Name of input file. input.txt output_file : Name of output file, output.txt run_size : Size of a run (can fit in RAM) num_ways : Number of runs to be merged
The idea is very simple, All the elements cannot be sorted at once as the size is very large. So the data is divided into chunks and then sorted using merge sort. The sorted data is then dumped into files. As such huge amount of data cannot be handled altogether. Now After sorting the individual chunks. Sort the whole array by using the idea of merge k sorted arrays.
- Read input_file such that at most ‘run_size’ elements are read at a time. Do following for the every run read in an array.
- Sort the run using MergeSort.
- Store the sorted array in a file. Lets say ‘i’ for ith file.
- Merge the sorted files using the approach discussed merge k sorted arrays
Following is C++ implementation of the above steps.
- Time Complexity: O(n + run_size log run_size).
Time taken for merge sort is O(nlogn), but there are at most run_size elements. So the time complexity is O(run_size log run_size) and then to merge the sorted arrays the time complexity is O(n). Therefore, the overall time complexity is O(n + run_size log run_size).
- Auxiliary space:O(run_size).
run_size is the space needed to store the array.
Note: This code won’t work on online compiler as it requires file creation permissions. When run local machine, it produces sample input file “input.txt” with 10000 random numbers. It sorts the numbers and puts the sorted numbers in a file “output.txt”. It also generates files with names 1, 2, .. to store sorted runs.
This article is contributed by Aditya Goel. Please write comments if you find anything incorrect, or you want to share more information about the topic discussed above
Attention reader! Don’t stop learning now. Get hold of all the important DSA concepts with the DSA Self Paced Course at a student-friendly price and become industry ready.
- Know Your Sorting Algorithm | Set 1 (Sorting Weapons used by Programming Languages)
- Know Your Sorting Algorithm | Set 2 (Introsort- C++’s Sorting Weapon)
- Sorting objects using In-Place sorting algorithm
- Find whether it is possible to make array elements same using one external number
- Find the Minimum length Unsorted Subarray, sorting which makes the complete array sorted
- Stability in sorting algorithms
- Which sorting algorithm makes minimum number of memory writes?
- Lower bound for comparison based sorting algorithms
- A Pancake Sorting Problem
- Cartesian Tree Sorting
- Sorting 2D Vector in C++ | Set 2 (In descending order by row and column)
- Sleep Sort – The King of Laziness / Sorting while Sleeping
- Sorting Vector of Pairs in C++ | Set 1 (Sort by first and second)
- Sorting Vector of Pairs in C++ | Set 2 (Sort in descending order by first and second)
- Sorting 2D Vector in C++ | Set 3 (By number of columns)
- Structure Sorting (By Multiple Rules) in C++
- Asymptotic Analysis and comparison of sorting algorithms
- Sorting possible using size 3 subarray rotation
- Sorting Big Integers
- Sorting Terminology
Improved By : andrew1234