Design data structures for a very large social network like Facebook or Linkedln
How would you design the data structures for a very large social network like Facebook or Linkedln? Describe how you would design an algorithm to show the shortest path between two people (e.g., Me-> Bob-> Susan-> Jason-> You).
Asked In : Google Interview
A good way to approach this problem is to remove some of the constraints and solve it for that situation first.
Case 1: Simplify the Problem (Not considering millions of people)
We can construct a graph by treating each person as a node and letting an edge between two nodes indicate that the two users are friends. If we want to find the path between two people, we start with one person and do a simple breadth-first search. Alternatively, we can do bidirectional breadth first search. This means doing two breadth first searches, one from the source and one from the destination. When the searches collide, we know we’ve found a path.
Why not a depth-first search work well? First, the depth-first search would just find a path. It wouldn’t necessarily find the shortest path. Second, even if we just needed any path, it would be very inefficient. Two users might be only one degree of separation apart, but it could search millions of nodes in their”subtrees” before finding this relatively immediate connection.
In the implementation, we’ll use two classes to help us. BFSData holds the data needed for a breadth-first search, such as the isVisited hash table and the toVisit queue. PathNode represents the path as we’re searching, storing each Person and the previousNode we visited in this path.
Main Logic in Java given below
How fast is above BFS based solution?
Suppose every person has k friends, and Source S and Destination D have a friend C in common.
1. Traditional breadth-first search from S to D: We go through roughly k+k*k nodes: each of S’s k friends, and then each of their k friends.
2. Bidirectional breadth-first search: We go through 2k nodes: each of S’s k friends and each of D’s k friends. Of course, 2k is much less than k+k*k.
3. Generalizing this to a path of length q, we have this:
3.1 BFS: O(kq)
3.2 Bidirectional BFS: 0( kq/2 + kq/2), which is just 0( kq/2)
If we imagine a path like A->B->C->D->E where each person has 100 friends, this is a big difference. BFS will require looking at 100 million (1004) nodes. A bidirectional BFS will require looking at only 20,000 nodes (2 x 1002).
Case 2: Handle Millions of Users
For these many users, we cannot possibly keep all of our data on one machine. That means that our simple Person data structure from above doesn’t quite work-our friends may not live on the same machine as we do. Instead, we can replace our list of friends with a list of their IDs, and traverse as follows:
1: For each friend ID: int machine index = getMachineIDForUser(person_ID);
2: Go to machine #machine_index
3: On that machine, do: Person friend = getPersonWithID( person_ID);
The code below outlines this process. We’ve defined a class Server, which holds a list of all the machines, and a class Machine, which represents a single machine. Both classes have hash tables to efficiently lookup data.
Main Logic in Java given below->
Following are some optimizations and follow-up questions.
Optimization: Reduce machine jumps
Jumping from one machine to another is expensive. Instead of randomly jumping from machine to machine with each friend, try to batch this jumps-e.g., if five of my friends live on one machine, I should look them up all at once.
Optimization: Smart division of people and machines
People are much more likely to be friends with people who live in the same country as they do. Rather than randomly dividing people across machines, try to divide them by country, city, state, and so on. This will reduce the number of jumps.
Question: Breadth-first search usually requires “marking” a node as visited. How do you do that in this case?
Usually, in BFS, we mark a node as visited by setting a visited flag in its node class. Here, we don’t want to do that. There could be multiple searches going on at the same time, so it’s a bad idea to just edit our data.
Instead, we could mimic the marking of nodes with a hash table to look up a node id and determine whether it’s been visited.
Other Follow-Up Questions:
1. In the real world, servers fail. How does this affect you?
2. How could you take advantage of caching?
3. Do you search until the end of the graph (infinite)? How do you decide when to give up?
4. In real life, some people have more friends of friends than others and are therefore more likely to make a path between you and someone else. How could you use this data to pick where to start traversing?
This article is contributed by Mr. Somesh Awasthi. If you like GeeksforGeeks and would like to contribute, you can also write an article using contribute.geeksforgeeks.org or mail your article to firstname.lastname@example.org. See your article appearing on the GeeksforGeeks main page and help other Geeks.
Please write comments if you find anything incorrect, or you want to share more information about the topic discussed above.