Outage Alert Completed Network issue affecting all UK hosting 24.11.2018 04:20 UTC


Apologies, I had this mail just before 11pm UTC while I was asleep:

We've noticed a memory leak on MX480-THN2 router. Case was raised with JTAC who is investigating the cause. As a preventive measure we'll be switching this router to a backup routing engine, which is not affected by the leak, in a controlled manner, to avoid any potential uncontrolled disruption.

Although downtime should not occur in the configuration this router is running, all failover operation inherently carry some risks, therefore we'll perform this operation during the time of lowest utilisation and with engineers present on site. Customers directly connected to this device might be affected, although we'll route around where possible prior to commencing the operation. Engineers will check routing engine synchronisation as well as nonstop-routing/bridging status before commencing.

=== Date: 24/11/2018 Window: 2:00 AM GMT - 6AM GMT

Rollback plan: In case of issues we'll change the routing engine 0 to primary state.

I've opened a ticket with them to find out what's happening now.


RFO received

Earlier today we have observed a significant memory leak on master re (re0). With out normal memory consumption at 17%, from around 2pm we saw this increase to 93% (and re0 started swapping). It has become critical for the fix to be implemented as soon as possible as given the ever increasing leak it was only a matter of time before rpd (routing process daemon) crashed causing a full blown outage at potentially peak time. We were faced with a choice - either announce emergency window and tackle it in a controlled manner or announce normal window and risk it crashing it in say a peak time. We have decided to ack now in agreement with JTAC. Based on the statistics we have established lowest traffic to be after 3am and before 7am. Emergency window was announced and executed.

The mastership switchover was successful. The re1 became master and re0 with memory leak became backup. After several minutes we have received the following emergency log from the router:

Message from syslogd@MX480-THN2 at Nov 24 04:22:37 ...
MX480-THN2 fpc1 local0.emerg TurboTx[713]: FATAL ERROR: Invalid DB. Exiting Turbotx process

followed by fpc1 disconnecting from the chassis and rebooting. FPC1 is responsible for port to Telia, 4 of 12 ports to Telehouse Dedicated Servers Customers as well as core link to Equinix LD8 which feeds TATA, Liberty Global and Cogent. Ports stayed physically up during the crash and significant part of the fpc reboot process. While we run N+1 resiliency at this site, it did not fully mitigate this particular problem but only helped reduce the impact...

This has created a number of issues. 1) Telia, TATA, Cogent and Liberty Global continued to route traffic to a port that was not accepting traffic, causing an effective blackhole and 2) connectivity to Dedicated Server Customers at Telehouse was impacted for some time until the switch dropped the non-responsive ports. This in turn created up to 2 outages depending on the service. 1) partial outage until BGP dropped and reconverged and 2) Layer 2 outage on VPLS terminating towards Dedicated Server Customers duets high error rate.

The traffic returned to majority of affected Customers in <10 minutes, some partial downtime continued to some Customers for up to 25 minutes. Customers using VPLS saw their service down for the duration of FPC1 downtime for the endpoints connected via FPC1 directly.

We have taken action to stop the leak occurring i the future as per Juniper's advice. We don't anticipate any further downtime at this stage and the network should be considered stable.

Please accept our apologies for the inconvenience caused.