- Big Data Counting: How To Count A Billion Distinct Objects Using Only 1.5KB Of Memory – this article showed you 3 different ways of distinct object counting. They are HashSet, Linear Probabilistic Counter and HyperLogLog. It is tradeoff between space and accuracy. When your object size is not big, you may simply create a unique key per object and track # of unique key in the Set. However, when you need to deal with over billion of them, you may not want to create and put those unique keys in memory. Then, you may need to ask if you really need the count to be 100% accurate? If a close approximation is good enough for you, you may consider other techniques.
A Bloom filter is a data structure designed to tell you, rapidly and memory-efficiently, whether an element is present in a set. The price paid for this efficiency is that a Bloom filter is a probabilistic data structure: it tells us that the element either definitely is NOT in the set or may be in the set. For those who wants to dig more into it, you should first take a look at this video.
After that you can check the articles below:
- What are Bloom Filters – A great tutorial of bloom filter that explained to you why you need it and how it works internally. The article is a bit lengthy as it covered some basic of how “hashing” works. A great read overall.
- Bloom Filter Tutorial – This article is by far the quickiest way for you to pick up the idea of bloom filter and how it works. It includes some widgets to let you experiment it as well. The speed of the hash functions are important to get the existence check fast like murmur is 800% faster than md5. See this post for detailed.
- Google Guava – The data structure is so useful that Google has included it in its Guava library.
- Bloom Filter Usage In Crawler – This article showed you that why Bloom Filter is necessarily if you want to build a crawler that crawls the Net like Google did. As crawler needs to check if an URL is previously crawled, it can be done if you put every URLs crawled as key in HashMap and answer this question via lookup. But if you are dealing with billion of urls, you may not have enough memory to hold all crawled URLs. That is where we should use Bloom Filter.
Consistent hashing for Sharding
If you shard across multiple servers (ie.1,2 and 3), you can distribute your key via hash(key)%3. But if you do that, you will have trouble if you have add or delete a server as the # you mod will be changed and all keys need to recalculate and migrate to the new servers. It will jam your internal network depending on how many we are talking about here. To avoid that, you don’t mod with a number that change. For example, you can predefine a big number like 1000 as constant, hash your server id via hash(server_id)%1000 and place it on the 0-999 list as marker. After that, the result from the hash(key)%1000 can be used to find the index in the list and then you can walk up in the list to find the closest server that your key should be assigned. If you walk to the end of the list and not able to find a marker, you continue from the beginning like a circular list. The power of this strategy is if you add and delete a server you don’t trigger all keys to change their servers. Moreover, you can create > 1 markers per server via applying different hash algorithm to make the assignment of the key more distributed and random. For more powerful machine you can hash more times so it distributes more markers in the list to obtain more keys.
HA on the globe
How Algolia achieve HA in worldwide – This article went thru the path that Algolia took to reach HA in worldwide. It is a great read.
- If the data can fit into a single machine, it will make HA simpler to setup.
- A cluster of 3 replica is common practice.
- For Algolia to address the geo-distance issue, they build DSN (ie. distribute search network) across different regions rather than using CDN on top of a search engine as it causes more problems (eg. cache invalidation) than value given only small % of queries are frequently made.
- Inter-region replication thru replicating write operations at the application level rather than file system level.
- DSN is used to redirect the end user directly to the closest location. Route53 DNS service of Amazon is good but has regional limit. Its latency-based routing is limited to the AWS regions. If you have locations not covered by AWS like India, Hong Kong, Canada and Russia, you can consider to use NSOne.
- The Random File Corruption due to TRIM implementation of some of their SSDs. However, Algolia didn’t lose data because it has 3 data replica and for its replica it doesn’t replicate the final indexed result but the operations itself so even data corruption on a disk of a machine will be contained instead of spilling over to others.
- Algolia found an optimal setup with a machine containing an Intel Xeon E5 1650v2, 128GB of RAM, and 2x400GB Intel S3700 SSD. The model of the SSD was very important for durability. They burned a lot of SSDs before finding the correct model that can operate in production for years.
CAP theorem said that for consistency, availability and partition, you can only have 2. For global system that deals with tons of traffics in speed, consistency is the one that people lower its bar. Instead of making sure all replica showing the same value, they will first make sure the update shown on majority of the replica then continue. Later on, all replica will be guaranteed to be in sync (ie. Eventual Consistency).
Distributed Consensus – RAFT
If you have multiple replica, how you handle a write operation in fault tolerant way? This is the problem of distributed consensus. To achieve that, a consensus protocol named RAFT (similar to PAXO) is used. Here you can read more about the consensus protocol RAFT.
- Redis key/value store that we use to check rates and limits as well as storing real time logs and counters for each application ID. These counters are used to build our real time dashboard which can be viewed when you connect to your account. This is useful for visualizing your last API calls and for debugging.
- Raft: In search of an Understandable Consensus Algorithm – The original paper of Raft. Raft is a consensus algorithm that is based on Paxos. Compared to Paxos, Raft is designed to have fewer states and a simpler, more understandable algorithm.
- The Secret Lives of Data – this link provides us a great visual explanation of Raft
- Algolia Architecture Review
- Alogolia Architecture in Detailed – from HighScalability.com