We intensely monitor our servers. We want to know things before a work ticket reaches us.
So a month ago one morning I saw notifications where a couple servers failed login checks. (A process does a login and logout for each server multiple times an hour.) These go to the servers directly. Another check comes in the front door like a regular user. It also was failing, which is super bad.
My first instinct was to find if there was a running process for our shutdown script. There was and I killed the process. Then I found the crontab entry that started this and removed it.
At this point there was a hard decision to make very fast:
- Recover this one.
- Make sure the others instances are not affected.
I ended up doing the latter. In retrospect, I guess I wanted to ensure I did not have multiple fires. If others were doing it too, then I would ask coworkers to help. If just the one, then I could handle it. And it was only a couple minutes to check by checking the dates in the crontab of certain hosts for the shutdown script. This one of the ten was the only one affected.
So I resumed the recovery. The first thing the shut down script does is flip a flag in a file that tells the load balancer whether to allow traffic to the servers. I reversed that first. Half the servers started picking up the traffic and ended the outage. Then I started up the 5 of 10 servers that had shut down.
From start of the outage to when users were back in was about 14 minutes.
Usage was pretty light because the term ended a few days prior.
Probably this was a holdover from doing upgrades the year prior. Crontab does not have year, just month/day or weekday. So we have to make sure we remove things targeted for a specific day. (Or start using at more.)
Tags: posted 2013