Given a universe U of n elements, a collection of subsets of U say S = {S_{1}, S_{2}…,S_{m}} where every subset S_{i} has an associated cost. Find a minimum cost subcollection of S that covers all elements of U.

Example:

U = {1,2,3,4,5} S = {S_{1},S_{2},S_{3}} S_{1}= {4,1,3}, Cost(S_{1}) = 5 S_{2}= {2,5}, Cost(S_{2}) = 10 S_{3}= {1,4,3,2}, Cost(S_{3}) = 3 Output: Minimum cost of set cover is 13 and set cover is {S2, S3} There are two possible set covers {S_{1}, S_{2}} with cost 15 and {S_{2}, S_{3}} with cost 13.

**Why is it useful?**

It was one of Karpâ€™s NP-complete problems, shown to be so in 1972. Other applications: edge covering, vertex cover Interesting example: IBM finds computer viruses (wikipedia) Elements- 5000 known viruses Sets- 9000 substrings of 20 or more consecutive bytes from viruses, not found in â€˜goodâ€™ code. A set cover of 180 was found. It suffices to search for these 180 substrings to verify the existence of known computer viruses.

Another example:

Consider General Motors needs to buy a certain amount of varied supplies and there are suppliers that offer various deals for different combinations of materials (Supplier A: 2 tons of steel + 500 tiles for $x; Supplier B: 1 ton of steel + 2000 tiles for $y; etc.). You could use set covering to find the best way to get all the materials while minimizing cost

Source: http://math.mit.edu/~goemans/18434S06/setcover-tamara.pdf

**Set Cover is NP-Hard:** There is no polynomial time solution available for this problem as the problem is a known NP-Hard problem. There is a polynomial time Greedy approximate algorithm, the greedy algorithm provides a Logn approximate algorithm.

**2-Approximate Greedy Algorithm:** Let U be the universe of elements, {S_{1}, S_{2}, … S_{m}} be collection of subsets of U and Cost(S_{1}), C(S_{2}), … Cost(S_{m}) be costs of subsets.

1) Let I represents set of elements included so far. Initialize I = {} 2) Do following while I is not same as U. a) Find the set S_{i}in {S_{1}, S_{2}, ... S_{m}} whose cost effectiveness is smallest, i.e., the ratio of cost C(S_{i}) and number of newly added elements is minimum. Basically we pick the set for which following value is minimum. Cost(S_{i}) / |S_{i}- I| b) Add elements of above picked S_{i}to I, i.e., I = I U S_{i}

**Example: **Let us consider the above example to understand Greedy Algorithm.

*First Iteration:* I = {} The per new element cost for S_{1} = Cost(S_{1})/|S_{1} – I| = 5/3

The per new element cost for S_{2} = Cost(S_{2})/|S_{2} – I| = 10/2

The per new element cost for S_{3} = Cost(S_{3})/|S_{3} – I| = 3/4

Since S_{3} has minimum value S_{3} is added, I becomes {1,4,3,2}.

*Second Iteration:* I = {1,4,3,2} The per new element cost for S_{1} = Cost(S_{1})/|S_{1} – I| = 5/0 Note that S_{1} doesn’t add any new element to I. The per new element cost for S_{2} = Cost(S_{2})/|S_{2} – I| = 10/1 Note that S_{2} adds only 5 to I. The greedy algorithm provides the optimal solution for above example, but it may not provide optimal solution all the time.

Consider the following example.

S_{1}= {1, 2} S_{2}= {2, 3, 4, 5} S_{3}= {6, 7, 8, 9, 10, 11, 12, 13} S_{4}= {1, 3, 5, 7, 9, 11, 13} S_{5}= {2, 4, 6, 8, 10, 12, 13} Let the cost of every set be same. The greedy algorithm produces result as {S_{3}, S_{2}, S_{1}} The optimal solution is {S_{4}, S_{5}}

**Proof that the above greedy algorithm is Logn approximate.**

Let OPT be the cost of optimal solution. Say (k-1) elements are covered before an iteration of above greedy algorithm. The cost of the k’th element <= OPT / (n-k+1) (Note that cost of an element is evaluated by cost of its set divided by number of elements added by its set). How did we get this result? Since k’th element is not covered yet, there is a S_{i} that has not been covered before the current step of greedy algorithm and it is there in OPT. Since greedy algorithm picks the most cost effective S_{i}, per-element-cost in the picked set must be smaller than OPT divided by remaining elements. Therefore cost of k’th element <= OPT/|U-I| (Note that U-I is set of not yet covered elements in Greedy Algorithm).

The value of |U-I| is n – (k-1) which is n-k+1.

Cost of Greedy Algorithm = Sum of costs of n elements [putting k = 1, 2..n in above formula] <= (OPT/n + OPT(n-1) + ... + OPT/n) <= OPT(1 + 1/2 + ...... 1/n) [Since 1 + 1/2 + .. 1/n ? Log n] <= OPT * Logn

Source: http://math.mit.edu/~goemans/18434S06/setcover-tamara.pdf

The Set Cover problem is a classic NP-hard problem that involves finding the minimum number of sets that cover all elements in a given universe. In other words, given a universe U and a collection S of subsets of U, the Set Cover problem is to find a subset C of S such that every element in U is contained in at least one set in C and the size of C is minimized.

One approach to solving the Set Cover problem is to use a greedy algorithm, which iteratively selects the set that covers the most uncovered elements until all elements are covered. Here’s how the greedy algorithm works:

Initialize an empty set C to be the cover.

While there are uncovered elements:

a. Select the set S that covers the most uncovered elements.

b. Add S to C.

c. Remove all covered elements from the set of uncovered elements.

Return C as the cover.

This algorithm provides an approximate solution to the Set Cover problem. The approximation factor is ln(n), where n is the number of elements in the universe U. In other words, the greedy algorithm will always find a cover that is at most ln(n) times larger than the optimal cover.

### Advantages:

- The greedy algorithm is simple and easy to implement.
- It runs in polynomial time, with a time complexity of O(nm), where n is the number of elements in U and m is the number of sets in S.
- The approximation factor of ln(n) is a proven guarantee, so we know that the solution is at most ln(n) times larger than the optimal solution.

### Disadvantages:

- The greedy algorithm may not always find the optimal solution, so it is only an approximation algorithm.
- The greedy algorithm relies heavily on the initial ordering of the sets, which can affect the quality of the solution.
- The approximation factor of ln(n) can still be large, especially for small values of n.