Tuesday, 5 January 2016

Design a distributed crawl system

1. Constraint

Assuming 1 trillion web pages
Assuming each page has 10 links to other web pages on average
Assuming each web page is updated every 7 days on average

So, the task is to crawl 1 trillion web pages in 7 days, which is 1,653,439 web pages per second.
So, the peak hour task is 1,653,439 * 5 = 8,267,195 web pages per second.

Assuming that each machine can process 10 web pages per second. It requires 826,720 machines.
Assuming that each data center has 10,000 machines. It requires 83 data centers.

Assuming that each master machine can manage 1000 slave machines, it needs 830 master machines (each data center needs 10 master machines). A partitioning method is required between master machines. (e.g. prefix)

Assuming there are 1 million key words and each keyword is associated with 100,000,000 web pages. The data size requirement is 1,000,000 * ( [20 Bytes] + 100,000,000 * [20 Bytes]) = 
2,000,000,000,000,000 Byte = 2,000 TB.

Assuming that each machine has 1TB space, there is 2,000 machines.

2. Framework


No comments:

Post a Comment