We're Back: Colocation Provider Bungles for Two Days 1

We are (we hope) finally beyond the worst downtime in this site’s short history. Alas, my customers’ sites were down as well, for two periods on Friday and for most of the day Saturday.

What went wrong? There is a chain of helplessness that appeared to end at AtlantaNAP. The company that provides my hosting, Rails Machine, was as helpless as I was to fix the problem. Rails Machine operates about 20 servers in a hosting facility called SiteSouth. SiteSouth, as it turns out, operates a cage full of server racks in a large facility called AtlantaNAP.

Update: In the original version of this article, I laid the blame squarely upon AtlantaNAP. I got a call from one of their representatives on Monday morning stating that the problem was, in fact, SiteSouth’s, and that AtlantaNAP had done what they could to help them out. From my vantage point, I can’t tell who was really at fault here. I’ve updated the article to reflect this ambiguity.

Either AtlantaNAP or SiteSouth had some sort of router problem that apparently caused some or all of the servers in SiteSouth’s cage, including all those operated by Rails Machine, to lose connectivity. AtlantaNAP has many connections to the Internet to provide redundancy, since reliable connectivity is one of the core attributes for a hosting provider. But if a router between those multiple connections and your site fails or is misconfigured, then it doesn’t matter how many connections to the Internet there are on the other side of the router.

That this kind of failure can happen is understandable. For a cohosting facility to take an hour and 45 minutes to correct it is bad, but tolerable if it’s an extremely rare event. But for the failure to repeat twice again that same day, and then for many hours the next day, is really unforgivable.

I suspect some finger-pointing may continue between SiteSouth and AtlantaNAP. Whoever was to blame, both companies are likely to lose some business over this. Within a hour of the first outage, Bradley Taylor at Rails Machine was looking for a new hosting facility, at least as a supplement, and surely the intensity of that effort accelerated as the outages repeated almost unbelievably.

Here’s the log from the site24×7 monitoring service, showing the start time and total downtime for each outage:

  • April 6, 7:51 AM: 1 Hrs 46 Mins
  • April 6, 9:55 AM: 1 Hrs 17 Mins
  • April 6, 2:14 PM: 13 Mins 30 Secs
  • April 7, 9:52 AM : 4 Hrs 21 Mins
  • April 7, 3:47 PM: 13 Mins 51 Secs
  • April 7, 5:33 PM: 1 Hrs 29 Mins

What Happened?

I don’t know yet, and may never know, what really happened. Here’s what I could see from the outside.

Here’s the tracert output for the path from my provider (Comcast) to AtlantaNAP, where the trace bounced back and forth repeatedly between two IP addresses (it repeated more times than I’ve shown here) but never reached my server:

  6    11 ms     9 ms    21 ms  te-8-1-ar01.sfsutro.ca.sfba.comcast.net [68.87.192.137]
  7    10 ms     9 ms     9 ms  68.86.143.9
  8     *        *       14 ms  68.86.90.165
  9    11 ms    11 ms    11 ms  64.215.30.201
 10    71 ms    68 ms    70 ms  NLAYER-COMMUNICATIONS-INC.ge-4-1-0.410.ar4.ATL1.gblx.net [206.41.25.
230]
 11    69 ms    71 ms    69 ms  atl-core-a-tgi2-1.gnax.net [209.51.149.105]
 12    70 ms    69 ms    69 ms  63.247.69.182
 13    69 ms    69 ms    70 ms  209.51.156.5
 14    69 ms    69 ms    71 ms  209.51.156.6
 15    69 ms    69 ms    69 ms  209.51.156.5
 16    71 ms    69 ms    69 ms  209.51.156.6

When the problem was finally (I hope!) fixed Saturday evening, the route changed, getting quickly to my server. All the changes occur in AtlantaNAP’s (or SiteSouth’s) routing. Here’s the current route:

  6    15 ms     9 ms     *     te-8-1-ar01.sfsutro.ca.sfba.comcast.net [68.87.192.137]
  7    11 ms    11 ms     9 ms  68.86.143.9
  8     *       12 ms     *     68.86.90.165
  9    11 ms    12 ms    11 ms  64.215.30.201
 10    69 ms    69 ms    69 ms  NLAYER-COMMUNICATIONS-INC.ge-4-1-0.410.ar4.ATL1.gblx.net [206.41.25.
230]
 11    84 ms    72 ms    71 ms  atl-core-a-tgi2-1.gnax.net [209.51.149.105]
 12    70 ms    69 ms    69 ms  63.247.69.182
 13    70 ms    71 ms    68 ms  207.210.123.118

I hope to have some news in the next few days, in part about what went wrong, but more important at this point, what the plan is for the future for Rails Machine to move beyond SiteSouth and AtlantaNAP.

Comments

Leave a response

  1. Michael GeorgeApril 29, 2007 @ 01:47 PM
    I can assure you that this was a 100% AtlantaNap issue. I'd like to know who from AtlantaNap contacted you and said otherwise.
Comment



If you're reading this message, your browser is not interpreting the CSS file properly, and your comment may not be posted.