Related Articles
The Multistage Algorithm in Data Analytics
• Last Updated : 29 Oct, 2020

In this article, we are going to discuss the multistage algorithm in data analytics in detail. We will also cover the working of multistage algorithm.

The Multistage Algorithm :
The Multistage Algorithm is the improved version of PCY algorithm that uses certain consecutive hash tables to decrease farther the count of candidate pairs. The contradiction in both of the algorithms is that multistage take more than two passes to discover the frequent pairs.

Working of multistage algorithm :

• First Pass :
The first pass of multistage is the identical as the first pass of PCY. After that pass, the frequent buckets are identified and encapsulated by a bitmap, again the same as in PCY. On the contrary, the second pass of multistage not counts the candidate pairs. Rather, it uses the accessible main memory for another hash table, using another hash function. After all the bitmap obtained from the first hash table takes up 1/32 of the accessible main memory whereas the second hash table has more or less as many buckets as the first.
• Second Pass :
At the point of second pass of multistage, we again go through the folder of baskets. There is no want to count the items again. The multistage algorithm uses supplementary hash tables to lessen the number of candidate pairs.

However, we must keep hold of the information about which items are frequent, since we need it on both the second and third passes. During the second pass, we hash unquestionable pairs of items to buckets of the second hash table.

In this second pass, you will see a pair is hashed only if it being counted in the second pass of PCY experience the two quality, And It will hash {i, j} if and only if both i and j happen often together, and then that pair is hashed to a frequent bucket during the first pass.

As a upshot, the sum of the counts in the second hash table should be remarkably less than the sum for the first pass. The outcome is that, even though the second hash table has only 31/32 of the number of buckets than the first table has, we anticipate there to be many fewer frequent buckets in the second hash table than in the first.

• Final Pass :
After the second pass, the second hash table is also encapsulated as a bitmap, and that bitmap is stored in main memory. The two bitmaps together take up slightly less than 1/16th of the accessible main memory, so there is still a lot of space to count the candidate pairs on the third pass.

A pair {i, j} is in C2 if and only if –

1. Both i and j both occur in the list of frequent items.
2. Pair {i, j} is hashed and transferred to a frequent bucket of the first hash table created.
3. Pair {i, j} is hashed and transferred to a frequent bucket of the second hash table created.

• The third constrain is the divergence between multistage and PCY :
It might be crystal clear that it is possible to enclose any number of passes between the first and last in the multistage algorithm. There is a restricting factor that each pass must reserve the bitmaps from each of the preceding passes. In due course, there is not enough space left in main memory to do the counts. It doesn’t effect that how many passes we apply, the candidly frequent pairs will every time hash a frequent bucket, so there is no way to circumvent counting them.

Attention reader! Don’t stop learning now. Get hold of all the important CS Theory concepts for SDE interviews with the CS Theory Course at a student-friendly price and become industry ready.

My Personal Notes arrow_drop_up