Oracle, Slack Outages Show Inadequacies of Legacy Systems, Say Cockroach Labs’ Founders

Legacy systems aren’t enough for today’s modern digital infrastructure demands, according to the co-founders.

(Peter Mattis, co-founder, CTO and product officer, Cockroach Labs)

While the latest news of a massive, global outage happened with X (formerly Twitter) on Monday, other recent outages have made headlines.

Slack, the widely used corporate messaging platform that was acquired by Salesforce in 2021, suffered an almost nine-hour outage on Feb. 26, affecting users “sending and receiving messages, using workflows, loading channels or threads, and logging into Slack. These features may have been degraded or in some cases fully unusable,” according to a blog post from Slack.

Then, on March 4, Oracle’s federal electronic health records [EHRs] experienced a nationwide outage, CNBC reported. The outage disrupted access for users in the Veterans’ Administration, the Department of Defense, the U.S. Coast Guard and the National Oceanic and Atmospheric Administration. EHRs are patient health-care records used by medical staff, and their unavailability can severely impact patient care, CNBC also reported.

Outages like those that affected Oracle and Slack become big news because they devastate critical services like corporate communications and health care. The co-founders of Cockroach Labs, a provider of cloud-native distributed SQL database offerings, assert that the legacy nature of some of these platforms’ infrastructure could be behind their downtime vulnerabilities.

‘A Gamble Businesses Can’t Afford To Take’

In a statement shared with MES Computing, Spencer Kimball, Cockroach Labs’ co-founder and CEO, weighed in on how legacy database infrastructure may have contributed to the Oracle EHR outage.

“The Oracle outage is a stark reminder that legacy systems are no longer adequate for the demands of modern digital infrastructure. These systems were never designed to handle the scale, complexity and real-time demands of today’s cloud environments. The result? Costly disruptions that put critical services at risk. Relying on outdated technology to power essential systems is a gamble businesses can’t afford to take. The real question is—how much longer will we let these vulnerabilities jeopardize the continuity of operations? The only way forward is to embrace cloud-native, distributed architectures that offer real resilience—ensuring that when the inevitable failures occur, operations continue without skipping a beat,” Kimball’s statement read.

[RELATED: Cockroach Labs CEO: Being Mid-Sized Doesn’t Mean You Don’t Have Big Database Needs]

MES Computing spoke with Peter Mattis, co-founder, CTO and product officer of Cockroach Labs. Mattis weighed in on the Slack outage. While he said he didn’t have any “internal details” on the outage, “we know what happened externally,” he said.

“What [Slack] described on their incident support channel … is there was some kind of issue with their database layer,” Mattis said.

“There are these database technologies, SQL database technologies, Postgres, MySQL. … These are single-node databases. And the issue of single-node databases for companies like Slack is you get to a certain scale and you can’t put all the data for all the customers on one database. So, when this happens, the traditional thing that’s done is you have multiple databases. You kind of partition your customers across databases. This is also known as sharding … they had an issue with some of their database shards,” he said.

Database sharding “is a type of horizontal partitioning that splits large databases into smaller components, which are faster and easier to manage,” according to a definition from Microsoft.

Legacy Database Management Practices Versus Modern

Mattis said that a company like Slack with a vast customer base is likely to have “hundreds or even thousands” of MySQL shards.

He said an operator error could have contributed to the outage during the database sharding process. Or the other common cause is “configuration error,” he added.

Mattis said that database sharding was a best practice for database management 10 years ago, and that has traditionally been the only way to scale out data storage systems.

Using distributed databases, like what Cockroach Labs provides, could help prevent database downtime, Mattis suggested. A reason why is that modern distributed SQL systems including CockroachDB have “replication built right in,” he said.

“If they actually had their data storage system running on CockroachDB, there would be replicas of [their] data spread across a whole bunch of different machines. Potentially across different data centers, right? One of those nodes goes down, there is essentially nothing to do. The system will notice it, and then it will self-heal … you want to have systems that self-heal [and] don’t require manual intervention to recover from outages,” he said.

MES Computing has contacted both Salesforce and Oracle for comments and will update with any responses.