A few useful things to consider for this scale in instant messaging:
1. The database needs to be distributed. The data volume and query throughput is too much for one server. Your billion users are also likely geographically distributed, so distributing your data helps here too.
2. You cannot use serially assigned primary keys for messages, as there would be too much overhead in the consensus mechanism for allocating these. UUIDs are a common choice here.
3. Instant messaging has a tricky, but predictable, data retrieval pattern: 99% of your stored messages are rarely accessed, but the most recent 1% of your messages are in extremely high demand. Your database needs to be able to keep the new messages in hot storage, but age the old ones out to cold (cheaper) storage.
4. Following on from that, as time is the overwhelming factor governing message retrieval likelihood: you'll want to partition your data by time. This allows you age data out cleanly, optimise for filesystem caching of recent partitions, and keep your index trees (probably user & message ID) from growing too large. You could even use a dedicated cache for the most recent hot messages.
As for your question about the user being offline for a few days, the speed of this would depend on how many partitions you're keeping hot. Querying the most recent data is something that timeseries/columnar databases do very well.
For the second question, a good case-study is Discord. They have API access layers that sit between users and their database. These API access layers batch requests for the same data into single DB queries. If multiple users log in at the same time and request the same range of messages, the access layers collect all the requests, send a single query to the DB, then copy the response to all the clients.
I use 'SELECT message, from, sentdate FROM offline_chat WHERE userid = <usersid?> AND acknowledged = 0'.....
Or something similar. I would bet it's all about indexing. It could be in a separate DB just for offline messages. When they come online, they would be moved to a ViewedDB or something. Could be just a really large table. =)
1. The database needs to be distributed. The data volume and query throughput is too much for one server. Your billion users are also likely geographically distributed, so distributing your data helps here too.
2. You cannot use serially assigned primary keys for messages, as there would be too much overhead in the consensus mechanism for allocating these. UUIDs are a common choice here.
3. Instant messaging has a tricky, but predictable, data retrieval pattern: 99% of your stored messages are rarely accessed, but the most recent 1% of your messages are in extremely high demand. Your database needs to be able to keep the new messages in hot storage, but age the old ones out to cold (cheaper) storage.
4. Following on from that, as time is the overwhelming factor governing message retrieval likelihood: you'll want to partition your data by time. This allows you age data out cleanly, optimise for filesystem caching of recent partitions, and keep your index trees (probably user & message ID) from growing too large. You could even use a dedicated cache for the most recent hot messages.
As for your question about the user being offline for a few days, the speed of this would depend on how many partitions you're keeping hot. Querying the most recent data is something that timeseries/columnar databases do very well.
For the second question, a good case-study is Discord. They have API access layers that sit between users and their database. These API access layers batch requests for the same data into single DB queries. If multiple users log in at the same time and request the same range of messages, the access layers collect all the requests, send a single query to the DB, then copy the response to all the clients.
reply