Juniper Router Failure Blamed In CloudFlare Outage

The firm said the outage affected all of its services, including DNS and any services that rely on its web proxy. The company blamed the cause of the outage on a "systemwide" failure of its Edge Routers.

A glitch caused a networkwide router failure to fail at the company's 23 data centers. The outage lasted about an hour. "When a router goes down, the routes to the network that sits behind the router are withdrawn from the rest of the Internet," the company said in a blog entry.

[Related: 10 Cloud Predictions For 2013 ]

"We have already reached out to Juniper to see if this is a known bug or something unique to our setup and the kind of traffic we were seeing at the time," CloudFlare said.

Sponsored post

The company is also providing service credits to accounts covered by service level agreements (SLAs). The firm said it encountered a distributed denial-of-service attack targeting one of its customers. The company used Flowspec, a protocol supported by Juniper, to propagate router rules to a large number of routers efficiently. An attack profile created a rule to dump packets between 99,971 and 99,985 bytes long.

"What should have happened is that no packet should have matched that rule because no packet was actually that large," the company said. "What happened instead is that the routers encountered the rule and then proceeded to consume all their RAM until they crashed."

Cloud reliability has been commonly cited as a serious risk for storing data in the cloud or using cloud-based services. Cloud-hosting companies had regular outages in 2012 and security experts warn that companies need to address the issue in service contracts and put a continuity plan in place to limit business disruptions.

CloudFlare said some of the Juniper routers did not automatically reboot and the company didn't have access to the routers' management ports, causing a delay in coming back online. "Even though some data centers came back online initially, they fell back over again because all the traffic across our entire network hit them and overloaded their resources," the company said.

CloudFlare said its team began to restore the network within 30 minutes, with full restoration in about an hour.

"We will be doing more extensive testing of Flowspec-provisioned filters and evaluating whether there are ways we can isolate the application of the rules to only those data centers that need to be updated, rather than applying the rules networkwide," CloudFlare said.

A Juniper spokesperson contacted by CRN said the company is aware of the incident and is investigating how the CloudFlare outage took place. The company is not aware of any other customers experiencing similar issues, the spokesperson said.

“While we have not completed our investigation, we believe this incident was triggered by a product issue that Juniper identified last October, when a patch was also made available," the spokesperson said in an email message. "Our customer support team is actively supporting CloudFlare in its efforts to resolve the issue."

CloudFlare has had security issues in the past. In June 2012, the company acknowledged that a series of problems led to a data-security breach of its network and an attack on one of its customers.