10/17 boberdoo Outage

boberdoo experienced a severe outage Tuesday evening that lasted over five hours beginning at 6:00 pm Central time. We are very sorry for any losses and inconveniences that this may have caused.

Issues occurred at the data center with their network. We lost networking to an entire range of our ip addresses in two of the cabinets. Network blips happen and are normally fixed quickly. This one they could not figure out. Eventually we decided to just move servers in the data center to the cabinets that were not having the problem and that resolved the issue. Making matters worse this also took down our email server and phone support. We posted to our Facebook and Twitter accounts that the problem was going on but we need to come up with a better solution and communicate that solution to our clients. A problem we expected to be fixed “any minute” turned into hours. This was awful and we are very upset and sorry.

The Details

About 4 months ago our data center told us they were taking over a new floor of the building downstairs and we would have to move to the new floor by the end of October. Their first proposed idea was unplugging the cabinets and wheeling them downstairs. That sounds simple enough but it was estimated that it might be 5 hours of total downtime to accomplish. I said there is no way we can do that so we need another solution. The solution was to set up a couple new cabinets in the new space and then move servers down to the new floor one by one after we have connected the floors like they are next to one another. The cabinets were connected and one of the system admins and one developer have been spending nights and weekends, along with basically all working hours the past month, getting machines moved. As of this morning we were about 80% moved, with some bumps, but nothing crazy and nothing affecting all clients. Web servers are load balanced and databases run in pairs so we can move without taking things down.

Tuesday around 6:00 pm Central time we lost connectivity from the new floor cabinets back upstairs to the old floor cabinets. We had planned to move a few things Tuesday evening, but after 11:00 pm. We had not changed anything so we checked with the data center to see if they were doing anything, having network problems, or had bumped a cable out.

There was nothing they could find even with five different engineers digging around trying to find the problem. From the outside, nobody could connect to one of the ip address ranges in the old floor cabinets. They checked all cable connections, restarted firewalls/switches, failed over to backup routers, etc. Nothing helped. The ip range was good on the new floor but could not be reached on the old floor. We really expected the problem to be found and fixed within minutes. That did not happen so after basically coming to a point where we had run out of ideas, we decided to just move the remaining servers on the old floor down to the new floor. That was around 10:15 pm. It took the data center around an hour to get the machines powered down, off the racks, downstairs, re-racked, cabled, and then powered up. Around 11:30 pm most of the systems were recovering.

To add insult to injury here is that we went through hundreds of hours of work and late nights to try and minimize downtime only to have this happen. We obviously need to improve. We need to make communication better. We need to separate not only our website, but our email server and phone calls. Just “move to Amazon” is unfortunately not so easy. We have had a team working on it for over a year and it is still not live. What you gain in redundancy, you lose in control and quality support. Amazon is where we are headed, but it may still be a few months and with many of our clients being busiest in Nov-Feb, it will probably happen for current clients after that.

FAQ

What happened to the data that attempted to post into boberdoo?

Unfortunately, our entire network was dead. Anything that was posted into boberdoo was lost and we have no way to recover it.

Why did it take so long to make the decision to move the servers to the other floor?

It took over an hour to just pinpoint that the problem was only on one ip range and only in the old location. We have been in this data center for 8+ years. It is the biggest and nicest data center in Chicago and historically, any networking issues that have arisen have been resolved quickly. So, we were feeling that as they went through trying to figure out the problem, the solution would be just around the corner and we'd be back up.

What measures are being taken to prevent this from happening in the future?

Although we cannot guarantee there will be no issues in the future, we can improve our reaction, communication and client outreach in the event of a similar issue.

See our biggest takeaway and initial action plan here.