Microsoft’s multi-factor authentication (MFA) for Office 365 and Azure Active Directory has fallen over for the second time in a week.
Azure’s service status page delivered Tuesday’s bad news:
Between 14:25 UTC and 17:08 UTC on 27 Nov 2018, customers using Multi-Factor Authentication (MFA) may have experienced intermittent issues signing into Azure resources, such as Azure Active Directory, when MFA is required by policy.
Officially, that’s just shy of three hours with either no or intermittent MFA, although it took until 18:53 UTC for Microsoft’s Twitter account to become confident enough to announce that the service was definitely up and running again.
Microsoft’s initial root cause analysis (RCA): something went wrong at DNS level which led the infrastructure supporting MFA to become “unhealthy”.
The solution was to reboot – which seemed to work but at the expense of receiving several sarcastic tweets congratulating Microsoft on a successful reboot/turning it off and on again.
Déjà vu – all over again
This issue is the latest in what’s fast becoming a long line of bloopers for Microsoft in recent weeks. The company has only just published an explanation for a longer and more serious MFA outage suffered on 19 November that left many customers unable to log into Office 365 or Azure for an entire working day, or in some cases, longer.
This included frank admissions about what the company said were three interconnected root causes:
- Under high traffic loads, the Azure MFA front-end server’s communication with cache services deteriorated (which, ironically. exist to boost performance).
- This caused a ‘race condition’ in processing responses from the MFA’s backend servers, a way of saying that different parts of the MFA system were out of sync with one another badly enough to stop them communicating properly.
- This then caused the backend services to overload at which point MFA stopped working.
Extraordinarily – this is the bit that will make some customers sit up – Microsoft didn’t notice any of this until users started complaining about MFA’s disappearance.
How so? Because:
Gaps in telemetry and monitoring for the MFA services delayed the identification and understanding of these root causes which caused an extended mitigation time.
Microsoft then explains how attempting to fix the above problems for APAC and EMEA regions by re-routing MFA traffic via the US caches simply made things worse there too.
Having issued a post-mortem for the first outage, Microsoft has promised to follow up with something similar for Tuesday’s.
What might be going on?
There is perhaps a small clue in the analysis for the 19 November outage where Microsoft mentions that the service was struggling to cope with high traffic levels.
Perhaps, then, it’s simply that lots of organisations and consumers have been turning on MFA, which wouldn’t be surprising given that Microsoft itself has been promoting the extra security benefits that it can bring.
So, let’s be positive: the outages might not be symptoms of MFA’s failure but rather of its sudden – and very welcome – popularity.