Where is my SSD Write?

Most databases use SSD for caching the data. The compute-storage separation architecture allows both compute & storage servers (aka page servers) to store the pages in an SSD cache. My team in a cloud database company started encountering a mysterious bug when some of the writes made to these SSD caches went missing occasionally 🤯

What does it mean for a write to go missing? The write gets successfully acknowledged by the database, but a subsequent read returns you a stale value! How do you even find the root cause of such an issue? Is it due to a software bug in compute or page server? Possibly a race condition which handles the SSD IOs? Perhaps a silent data corruption? The layers of complexity in database systems make it so much more difficult to isolate such rare issues.

The first step to deal with any kind of corruption is to detect it. With a corruption detection tool in place, we can know if any replica is mismatching with another, we know it has gone rogue and need a quick way to recover. Since distributed systems often deal with random failures, many modern databases often come with a mechanism to quickly recovery from corruptions, often by fetching the correct data from peer replicas. To detect and recover from corruption in itself is a big engineering challenge. But identifying the root cause of corruptions may lead engineering teams to build more resilient database systems.

In this particular case, the path to find out the root cause was long — collecting every possible aspect in the telemetry, possible time of corruption, number of writes missing, DBs affected, cache state etc. While trying to make sense of the software-level telemetry, we also decided to collect hardware information along with it. To our surprise, all the corruptions occurred on a select few SSD models which were way past out of support (e.g. Samsung PM953 or some similar SSDs). What followed was a collaboration with the hardware team to help them narrow down the exact SSD models and check if other services were facing similar issue with those SSDs. While there are particular metrics for SSD endurance (e.g. DWPD, drive writes per day), it was interesting to see that we observed the corruptions with only specific models deployed by the hardware team. This long battle finally ended when it was confirmed that certain SSD models and firmware combinations were known to experience the missing write issue across multiple services, but since SSDs were already out of support, there was no plan to patch it with a new firmware!