Microsoft Azure CTO Mark Russinovich Gets Technical About Pandemic Capacity Fixes

'The rise of digital technology usage has been dramatic, and I think it promises changes that are here to stay,' Russinovich says.

ARTICLE TITLE HERE

Microsoft Azure added 12 new edge sites around the world and increased its peering capacity by 25 percent to expand its wide area network after customers’ cloud usage surged following stay-at-home orders forced by the coronavirus (COVID-19) pandemic.

“In total, we added 110 terabits of increased capacity to our WAN in less than two months,” Microsoft Azure chief technology officer Mark Russinovich said in a video posted today.

As Microsoft’s cloud services -- particularly Azure, Teams, Windows Virtual Desktop and Xbox Live – experienced unprecedented demand, it prioritized critical front-line customers, scaled its services, implemented brownout controls, initiated optimizer services and shifted workloads from “hot” cloud regions to address surges that caused some customers to experience service slowdowns and outages.

id
unit-1659132512259
type
Sponsored post

Russinovich gave a behind-the-scenes, technical look at the cloud provider’s engineering response to meet the incredible spike in demand for Microsoft Azure’s IaaS, PaaS to SaaS offerings as people started to work, learn and teach from home.

“The rise of digital technology usage has been dramatic, and I think it promises changes that are here to stay,” Russinovich said.

As Microsoft saw demand spike for the Xbox Live gaming service and Teams communication and collaboration platform, its first goal was to meet the demand by addressing capacity in hardest-hit regions.

“After we scaled out our services to meet that demand, the next thing we did was start to forecast the kind of demand that we'd see if the spike continued at the rate that it was on,” Russinovich said. “We wanted to make sure that we were prepared for that potential case, so we immediately started to implement brownout controls. Most of these brown-out controls we never had to implement, but we had them there just in case the demand exceeded our ability to keep up with capacity.”

Microsoft also started optimizer services to free up capacity as customers continued to adopt its cloud services.

“We went into our services like Xbox, Teams and Windows Virtual Desktop and others…and optimized their CPU usage, as well as their cloud service resource usage,” Russinovich said. “We also shifted their loads from the hot regions into ones that had available capacity, so that our customers would have capacity in those hot regions.”

World Wide Technology (WWT), a St. Louis solution provider, welcomed Microsoft’s investments.

“The significant investments that Microsoft continues to make, especially during a time when the world in is the middle of a crisis, are beneficial to World Wide Technology,” said Dave Sellers, general manager of WWT’s multi-cloud practice. “As a Microsoft Gold Partner, the investments allow us to work with our clients in these challenging times to provide them with leading-edge solutions to enable them to perform at their peak.”

VPN And WAN Surges

Microsoft saw a huge increase in the use of its Azure Virtual Private Network (VPN) services. The VPN services allow corporate users to connect into their Azure networks, which then can connect to their corporate networks securely over encrypted channels, leveraging Microsoft's backbone to talk between their edge sites themselves, their corporate sites that are in different parts of the world or from their home PCs, according to Russinovich.

“We've got multiple VPN applications -- clients that customers can download -- and we saw a 700 percent increase -- starting in February and going up into mid-March -- of downloads of those VPN clients,” he said. “We saw 94 percent growth in connections to the Azure VPN service during that time.”

Microsoft Azure’s WAN includes 61 regions, 160,000-plus miles of fiber and subsea cable, more than 170 edge sites and more than 1 million miles of fiber cable per data center. The growth in WAN usage correlated directly to lockdowns, with traffic connecting its regions together and to the outside world spiking by 40x.

“We saw, coincident with that, a 50 percent increase in DDoS attacks, where malicious actors were trying to take down our corporate customers, which we defended through our Azure DDoS service,” Russinovich said.

When China locked down on Jan. 23, Microsoft saw a surge in WAN traffic that continued to grow through May. In Italy, WAN usage after a March 9 lockdown grew by more than 300 percent. In the United States, WAN usage immediately jumped by 60 percent in some cases after lockdowns started in early March.

“To meet this demand, we scaled out our WAN,” Russinovich said. “We started by adding 12 new edge sites around the world, so that customer traffic could enter our backbone as quickly as possible and not have to traverse between regions. We also increased our peering capacity by 25 percent, and that meant signing contracts with ISPs and deploying network gear to expand the network capacity into our networks. In total, we added 110 terabits of increased capacity to our WAN in less than two months.”

Windows Virtual Desktop Optimization

Windows Virtual Desktop, which is built 100 percent on Azure, also saw a giant spike in usage as corporations went from working on personal computers in an office environment to working on virtual desktops in the cloud.

“We saw 3x growth in just four weeks, and Microsoft itself became a heavy user of Windows Virtual Desktop to support our employees,” Russinovich said.

To optimize usage and to meet demand, Microsoft scaled out the cluster partition, which is how traffic enters the Windows Virtual Desktop service and through which the remote desktop protocol traffic flows.

“We added more gateways and front ends per cluster,” Russinovich explained. “We added additional clusters per region. We deployed more regions, and more regions are coming, because customers want Windows Virtual Desktop servers to be in the regions that their corporations are in.”

One of the hardest-hit services was an Azure SQL Database instance that stores information about which Windows Virtual Desktop sessions are active. That service had inefficiencies in the way that it optimizes indices, according to Russinovich,

“By optimizing the indices, they were able to drastically reduce the CPU usage load on the Azure SQL Database and avoid having to scale out,” he said. “The team also made other optimizations, like leveraging the read-only replicas of that Azure SQL Database instance to serve queries, instead of going to the primary all the time. That also helped reduce CPU load. They increased client-side caches on the front-end services, so that they could avoid going to the SQL Database altogether. And, finally, they also rebalanced traffic routing to nearby regions to help reduce the load on the WAN.”