{"id":6706,"date":"2013-01-09T07:00:33","date_gmt":"2013-01-09T12:00:33","guid":{"rendered":"http:\/\/www.ezrasf.com\/wplog\/?p=6706"},"modified":"2013-01-12T16:05:08","modified_gmt":"2013-01-12T21:05:08","slug":"anatomy-of-a-mistaken-shut-down","status":"publish","type":"post","link":"https:\/\/www.ezrasf.com\/wplog\/2013\/01\/09\/anatomy-of-a-mistaken-shut-down\/","title":{"rendered":"Anatomy of a Mistaken Shut Down"},"content":{"rendered":"<p>We intensely monitor our servers. We want to know things before a work ticket reaches us.<\/p>\n<p>So a \u00c2\u00a0month ago one morning I saw notifications where a couple servers failed login checks. (A process does a login and logout for each server multiple times an hour.) These go to the servers directly. Another check comes in the front door like a regular user. It also was failing, which is super bad.<\/p>\n<p><a title=\"Project 365: Day 014 by Ezra S F, on Flickr\" href=\"http:\/\/www.flickr.com\/photos\/sneezypb\/4278579751\/\"><img data-recalc-dims=\"1\" loading=\"lazy\" decoding=\"async\" alt=\"Project 365: Day 014\" src=\"https:\/\/i0.wp.com\/farm3.staticflickr.com\/2791\/4278579751_d0d4df2eae.jpg?resize=500%2C375\" width=\"500\" height=\"375\" \/><\/a><\/p>\n<p>My first instinct was to find if there was a running process for our shutdown script. There was and I killed the process. Then I found the crontab entry that started this and removed it.<\/p>\n<p>At this point there was a hard decision to make very fast:<\/p>\n<ol>\n<ol>\n<li>Recover this one.<\/li>\n<li>Make sure the others instances are not affected.<\/li>\n<\/ol>\n<\/ol>\n<p>I ended up doing the latter. In retrospect, I guess I wanted to ensure I did not have multiple fires. If others were doing it too, then I would ask coworkers to help. If just the one, then I could handle it. And it was only a couple minutes to check by checking the dates in the crontab of certain hosts for the shutdown script. This one of the ten was the only one affected.<\/p>\n<p>So I resumed the recovery. The first thing the shut down script does is flip a flag in a file that tells the load balancer whether to allow traffic to the servers. I reversed that first. Half the servers started picking up the traffic and ended the outage. Then I started up the 5 of 10 servers that had shut down.<\/p>\n<p>From start of the outage to when users were back in was about 14 minutes.<\/p>\n<p>Usage was pretty light because the term ended a few days prior.<\/p>\n<p>Probably this was a holdover from doing upgrades the year prior. Crontab does not have year, just month\/day or weekday. So we have to make sure we remove things targeted for a specific day. (Or start using at more.)<\/p>\n","protected":false},"excerpt":{"rendered":"<p>We intensely monitor our servers. We want to know things before a work ticket reaches us. So a \u00c2\u00a0month ago one morning I saw notifications where a couple servers failed login checks. (A process does a login and logout for each server multiple times an hour.) These go to the servers directly. Another check comes [&hellip;]<\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"jetpack_post_was_ever_published":false,"_jetpack_newsletter_access":"","_jetpack_dont_email_post_to_subs":false,"_jetpack_newsletter_tier_id":0,"_jetpack_memberships_contains_paywalled_content":false,"_jetpack_memberships_contains_paid_content":false,"activitypub_content_warning":"","activitypub_content_visibility":"","activitypub_max_image_attachments":4,"activitypub_interaction_policy_quote":"anyone","activitypub_status":"","footnotes":"","jetpack_publicize_message":"","jetpack_publicize_feature_enabled":true,"jetpack_social_post_already_shared":true,"jetpack_social_options":{"image_generator_settings":{"template":"highway","default_image_id":0,"font":"","enabled":false},"version":2}},"categories":[139],"tags":[2607],"class_list":["post-6706","post","type-post","status-publish","format-standard","hentry","category-georgiaview","tag-posted-2013"],"jetpack_publicize_connections":[],"jetpack_featured_media_url":"","jetpack_sharing_enabled":true,"jetpack_shortlink":"https:\/\/wp.me\/p1rUBW-1Ka","jetpack-related-posts":[],"jetpack_likes_enabled":true,"_links":{"self":[{"href":"https:\/\/www.ezrasf.com\/wplog\/wp-json\/wp\/v2\/posts\/6706","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.ezrasf.com\/wplog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.ezrasf.com\/wplog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.ezrasf.com\/wplog\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/www.ezrasf.com\/wplog\/wp-json\/wp\/v2\/comments?post=6706"}],"version-history":[{"count":0,"href":"https:\/\/www.ezrasf.com\/wplog\/wp-json\/wp\/v2\/posts\/6706\/revisions"}],"wp:attachment":[{"href":"https:\/\/www.ezrasf.com\/wplog\/wp-json\/wp\/v2\/media?parent=6706"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.ezrasf.com\/wplog\/wp-json\/wp\/v2\/categories?post=6706"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.ezrasf.com\/wplog\/wp-json\/wp\/v2\/tags?post=6706"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}