Tuesday, November 5, 2024

Microsoft working on OS update to prevent another IT outage

Must read

Microsoft says it’s working on Windows to allow endpoint security solutions to operate effectively outside of the operating system’s kernel, all with a view to preventing any future CrowdStrike-esque mega-outages.

Acknowledging the calls from customers and vendors to do this, Microsoft noted a number of challenges which must be overcome so those new capabilities satisfy the demands.

Performance needs outside of kernel mode and anti-tampering protections are among the issues requiring attention, it seems. Microsoft said it would consider security sensor requirements and secure-by-design as it tries to improve the architecture of Windows to allow antivirus tools to securely scrutinize systems while running in a lower-privileged space or environment.

The news comes from the Windows giant’s no-press-allowed security summit held this week. It appears Microsoft heard the angry hisses of the vultures, and decided to make the details of the summit public after initially hinting last month that they may not be.

As expected, in a room full of infosec experts from vendors all discussing the inner workings and weaknesses of the endpoint security ecosystem, not everything was revealed in Microsoft’s blog summarizing the event. Bad guys are always watching, and all that. 

However, those with a vested interest in the matter appeared to receive the summit and its conclusions well. 

Joe Levy, CEO at Sophos, said in a statement: “Microsoft’s Windows Endpoint Security Ecosystem Summit was a critical call to action for endpoint security providers following the global IT outage in July. The Summit gave us a chance to come together to start a dialogue about how and why we need to rethink important topics, such as kernel architectures, the risk of monocultures, safe deployment practices, vendor transparency, and much more.

“Before the outage, most of the world wasn’t thinking about who or what has access to the kernel, ELAM [Early Launch AntiMalware] features, data update rollouts, and other technologies that make protections ‘just happen’ for users, but that requires precise technical and architectural planning. Alarmingly, some security companies were not thinking sufficiently about these either.” 

Levy’s sentiment was largely echoed by others in attendance, including execs from Broadcom, SentinelOne, Trellix, and Trend Micro. ESET’s take was the same, but also stated that maintaining kernel access for security products is “imperative.”

Microsoft pointed to the planned changes to Windows, which were announced back in May – before the whole CrowdStrike disaster – which include an intent to ensure kernel access was made available on a just-in-time basis, rather than an always-on approach.

To recap: July’s CrowdStrike outage was caused by a faulty sensor update to Falcon, the vendor’s endpoint security platform. This update came in the form of a channel file, but this contained some data that led to a logic error causing Falcon to crash in such a way that Windows followed suit with a BSOD which bricked 8.5 millon PCs worldwide.

Members of the infosec community piled on CrowdStrike in the early days after the outage, before the root cause was made public, including claims of poor quality assurance (QA) before issuing patches, and jokes about interns losing their jobs. 

It’s worth noting that the QA angle was rejected by CrowdStrike. The company’s CEO George Kurtz recently said the sensor update was validated but suggested it was a freak incident that hasn’t occurred in the thousands of Falcon sensor updates issued over the years.

Kurtz said at Goldman Sachs’ Communacopia and Technology Conference this week: “So, in this particular case, we had a configuration change, which is like there’s no code, its just a config that the sensor consumes. And we went through a validation process and we validated all those. They actually worked. The problem is we had 21 of them and the sensor understood 20. And that’s the simple explanation of what happened.

“So, what have we changed in terms of the process? Well, we now run the configuration changes through not only the validation but all the various code QA processes we have and then deploy that in a phased rollout manner, as well as giving customers the choice on how they want to deploy that content.”

After the initial knee-jerk reactions to the outage died down, the more sensible critiques poured in from industry, namely those related to the degree to which security can run on the Windows kernel. It was a matter about which some customers and infosec experts demanded answers and change.

Microsoft previously suggested that the EU forced it in 2009 to provide security vendors the same level of access to its OS as its own security products. This was against a backdrop of longrunning European scrutiny of the company. 

Regardless of the reasons why, the kernel change is coming soon, Microsoft promised, and those alterations will be informed by input from the wider industry.

“As a next step, Microsoft will continue to design and develop this new platform capability with input and collaboration from ecosystem partners to achieve the goal of enhanced reliability without sacrificing security,” it said in its summit summary.

The other long-term project to be progressed by Microsoft and security vendors is the development of best practices for the safe rollout of platform updates. The idea would be to adopt them across the entire vendor ecosystem.

“We face a common set of challenges in safely rolling out updates to the large Windows ecosystem, from deciding how to do measured rollouts with a diverse set of endpoints to being able to pause or rollback if needed,” Microsoft said. 

“A core [Safe Deployment Practices (SDP)] principle is gradual and staged deployment of updates sent to customers. Microsoft Defender for Endpoint publishes SDPs and many of our ecosystem partners such as Broadcom, Sophos, and Trend Micro have shared how they approach SDPs as well. 

“This rich discussion at the Summit will continue as a collaborative effort with our MVI partners to create a shared set of best practices that we will use as an ecosystem going forward.”

In the shorter term, Microsoft said it is committed to making “rapid progress” on matters such as the testing of critical components, sharing intel on product health, incident response effectiveness, and joint compatibility testing across diverse configurations.

We await further updates with some anticipation. ®

Latest article