Marty Weiner from Pinterest has written a nice post on how they scaled their DB
What I liked most about approach is the sheer simplicity of how they did it
Any DB Scaling needs to cater to few things
- fast and easy lookups irrespective of the size of Data
- one should be able to find needle in haystack without sweating
- also means JOINs should be less & should never go across shards
- Scale-out with minimal or no impact to services
- High availability of the data
Key notes from the post are:
- chose Mysql for it is a battle tested db for large datasets
- => Sharding & Scaling are to be done outside the DB
- Master is always for write as well as read
- Avoids data sync issues
- Slave is only for availability purpose, and not for load sharing
- Use uuid (64 bit) to uniquely identify the two main Types of objects stored i.e., Pin-ID & board-id
- this uuid is made of Shard-id, type (pin or board), and row-id in corresponding table in that shard
- Shard-id range to actual shard mapping is stored in ZK
- Pin-ID & all other tables related to that PIN-id will be on same shard (including pid-ID to board-id mapping)
- Similarly board- and related data including board-id to pin-id mapping are on one shard
- all pins on a board may or may not be on same shard
- User login to shard mapping is maintained separately
- Scaling can be done by either
- New shard-id range addition to zk
- Splitting shard-id ranges (will require data migration before new shard is made available)
- Or of course, upgrading the machines 🙂