Users of Microsoft’s Azure system lost database records as part of a mass outage on Tuesday. A combination of DNS problems and automated scripts were to blame, said reports.
Microsoft deleted several Transparent Data Encryption (TDE) databases in Azure, holding live customer information. TDE databases dynamically encrypt the information they store, decrypting it when customers access it. Keeping the data encrypted at rest stops an intruder with access to the database from reading the information.
While there are different approaches to encrypting these tables, many Azure users store their own encryption keys in Microsoft’s Key Vault encryption key management system, in a process called Bring Your Own Key (BYOK).
The deletions were automated, triggered by a script that drops TDE database tables when their corresponding keys can no longer be accessed in the Key Vault, explained Microsoft in a letter reportedly sent to customers.
The company quickly restored the tables from a five-minute snapshot backup, but that meant any transactions that customers had processed within five minutes of the table drop would have to be dealt with manually. In this case, customers would have to raise a support ticket and ask for the database copy to be renamed to the original.
Why were the systems accessing the TDE tables unable to access the Key Vault? The answer stems from a far bigger issue for Microsoft and its Azure customers this week. An outage struck the cloud service worldwide on Tuesday, causing a range of problems. These included intermittent access to Office 365 in which users had only half a chance of logging on. Broader Azure cloud resources were also down.
This problem was, in turn, down to a DNS outage, according to Microsoft’s Azure status page:
Preliminary root cause: Engineers identified a DNS issue with an external DNS provider.
Mitigation: DNS services were failed over to an alternative DNS provider which mitigated the issue.
Reports suggested that this DNS outage came from CenturyLink, which provides DNS services to Microsoft. The company had suffered a software defect, it had said in a statement.
This shows what can go wrong when cloud-based systems are interconnected and automated enough to allow cascading failures. A software defect at a DNS provider indirectly led to the deletion of live customer information thanks to a lack of human intervention.
CenturyLink seems to be experiencing serial DNS problems lately. The company, which completed its $34bn acquisition of large network operator Level 3 in late 2017, also suffered a DNS outage in December that reportedly affected emergency services, sparking an FCC investigation.
Azure users can at least take comfort in the fact that Microsoft is offering multiple months of free Azure service for affected parties.
Mahhn
First it was Office365, then 364, now 363.
If you are trying to be the super big cloud service, you might want to have your own DNS servers (hours behind on updates to avoid bad data) to fall back on. Since it’s seams a common point of failure for large companies.
Captain Hindsight
Anonymous
Of course they have their own DNS Servers, the issue is with a external DNS Server. If I follow your logic, should aure/aws own the whole internet?
Mahhn
lol, I mean and External one they can maintain and can swap to if/when needed. Not their internal network.
Anonymous
I belive thats what the fix was they failed over to a secondary DNS provider but by that time the damage was already done
FrancoisK
What is the point of automatically dropping a table if the key to decrypt its content is no longer available? Wouldn’t the records remain equally safe from intruders?
Epic_Null
I think the idea is that there’s no reason to keep the table if you can’t get it’s key, so you can free up that space and allow other tables to use it.
Anonymous
So why should I move from my on-premises gear that never have these issues to cloud services?
FrancoisK
With a top-level DNS outage the issue with impact the ISP your company uses. You would lose access to external resources, but not your internal servers. You are correct… if your business doesn’t need to interface with the outside world (vendors, customers).
Anonymous
That’s just it, it wasn’t a “top level” dns issue that affected all ISP’s. It was an external DNS issue that Azure used. Note: The internet didnt go down yesterday around the world… On Premise gear didnt have issues and we could still interact with the outside world (vendors, customers). If top level DNS goes down, we’re not going to be commenting here… we’re grabbing our survival kits and will be living off the land.
FrancoisK
Clearly not all top-level DNS servers went down, only those used by Microsoft Azure (Century the article says). If your ISP’s DNS go down, you will lose internet service. Unless you have either automatically failover between two ISPs and/or DNS monitoring. And are able to do that in real-time, not 5′ minutes later as the article described.
Anonymous
Because AWS isn’t experiencing these issues
J
Because your CIO was told it would be a great idea.
John
I get that hindsight is 20/20, but it’s still hard to believe that no one considered this when designing a system that executes certain behavior based on whether it can reach another system across an inherently unreliable medium.
Mark
TDE encryption does not stop malicious actor with database access reading the data. Anybody granted access to the database and anybody who manages to obtain credentials can read the data.
TDE like bitlocker really only good if the physical access to the drive is in doubt.
Larry Page
Thanks for telling hackers how a dns denial of service attack could be used to hose many Azure customers. Given their record of patching other products I doubt MS will fix this very fast.
Ned Reed
Our company was one of those that had its database deleted. Microsoft took hours to restore the modest 12GB database. Our customers work in healthcare 24×7 and cannot tolerate unscheduled downtime like this. By luck, the failure happened just *before* we switched our product over to Azure. If the failure had happened after we had switched to Azure, our business would have suffered irreparable harm. We’re still shaking our heads at how this could have happened.
Anonymous
And after this issue are you still considering to move to Azure ?
Fahad Ur Rehman Khan
I’m an Azure user, I didn’t dint anything wrong with my network.
FrancoisK
The issue wasn’t Azure as a whole, it was SQL Server databases using TDE. If you are using those features, you would would have lost data modified during those 5 minutes in question.