Microsoft’s multi-factor authentication (MFA) for Office 365 and Azure Active Directory has fallen over for the second time in a week.
Azure’s service status page delivered Tuesday’s bad news:
Between 14:25 UTC and 17:08 UTC on 27 Nov 2018, customers using Multi-Factor Authentication (MFA) may have experienced intermittent issues signing into Azure resources, such as Azure Active Directory, when MFA is required by policy.
Officially, that’s just shy of three hours with either no or intermittent MFA, although it took until 18:53 UTC for Microsoft’s Twitter account to become confident enough to announce that the service was definitely up and running again.
https://twitter.com/MSFT365Status/status/1067521776333307906
Microsoft’s initial root cause analysis (RCA): something went wrong at DNS level which led the infrastructure supporting MFA to become “unhealthy”.
The solution was to reboot – which seemed to work but at the expense of receiving several sarcastic tweets congratulating Microsoft on a successful reboot/turning it off and on again.
— Mr M. (@nonozerobo59) November 27, 2018
Déjà vu – all over again
This issue is the latest in what’s fast becoming a long line of bloopers for Microsoft in recent weeks. The company has only just published an explanation for a longer and more serious MFA outage suffered on 19 November that left many customers unable to log into Office 365 or Azure for an entire working day, or in some cases, longer.
This included frank admissions about what the company said were three interconnected root causes:
- Under high traffic loads, the Azure MFA front-end server’s communication with cache services deteriorated (which, ironically. exist to boost performance).
- This caused a ‘race condition’ in processing responses from the MFA’s backend servers, a way of saying that different parts of the MFA system were out of sync with one another badly enough to stop them communicating properly.
- This then caused the backend services to overload at which point MFA stopped working.
Extraordinarily – this is the bit that will make some customers sit up – Microsoft didn’t notice any of this until users started complaining about MFA’s disappearance.
How so? Because:
Gaps in telemetry and monitoring for the MFA services delayed the identification and understanding of these root causes which caused an extended mitigation time.
Microsoft then explains how attempting to fix the above problems for APAC and EMEA regions by re-routing MFA traffic via the US caches simply made things worse there too.
Having issued a post-mortem for the first outage, Microsoft has promised to follow up with something similar for Tuesday’s.
What might be going on?
There is perhaps a small clue in the analysis for the 19 November outage where Microsoft mentions that the service was struggling to cope with high traffic levels.
Perhaps, then, it’s simply that lots of organisations and consumers have been turning on MFA, which wouldn’t be surprising given that Microsoft itself has been promoting the extra security benefits that it can bring.
So, let’s be positive: the outages might not be symptoms of MFA’s failure but rather of its sudden – and very welcome – popularity.
Mahhn
This would be comical if not for people having become dependent on Email. Which didn’t even exist 30 years ago (for the average person and businesses).
Maybe this will serve as a DR reminder, that Email can fail in large scale. The bigger and more complex the system, the harder it falls. It’s good to have contingency communication plans.
MikeP_UK
Once again Microsoft’s failure to test properly and fully exposes a serious flaw in use. I used to work in software development and we always did both scripted testing as well as user testing at the beta stage. Scripted testing, that Microsoft and many others rely upon, only shows any problems with what the development team considered worth checking. It relies upon their understanding of the specification for the coded application, but that could be full of misunderstandings and basic errors. Only when you let a ‘user’, who has not been involved in the development stages, run their fingers over the resultant application can you start finding some of the fundemental errors and wrong assumptions made by the dev team.
Real people are better at testing something than a machine that is only told what to do.
David Shumate
Sure is odd that a mature organization like Microsoft is making some very basic rules based mistakes that most of us learned years ago and stick to without exception. The admissions made by Microsoft should cause concern because most of the issues identified by them are rookie mistakes that anyone with a background in this work makes certain they never repeat again; we all found these things out the hard way, years ago. So, who at Microsoft is allowing rookies to make decisions for the deployment and monitoring of critical infrastructure that is causing massive outages to thousand of companies that rely on them? Head in hands.
Mark
We use Office365 and it really feels like a product in constant development. They often roll out changes that can cause us issues
Office 365 MFA
People can put up with occasional issues, but when the system goes down for an extended period and starts to affect productivity the knock-on effects can be serious (for example, access to you email being prevented could result in the loss of an order).
J K Birks
I guess the lesson to take away from this is never put all your eggs in one basket, and ensure that you have redundant alternatives when problems like this arise.
You could consider hosting your own multi-factor authentication server etc, but I guess we just have to ensure we are as prepared as possible for when things like this occur, and when they do we learn from them.