Have not slept well the past week… I have been the on call DBA. So I have the pager. Monday we joked things seem to be worse when I have it. Well… this week has been a shining example.
Monday: OVIEW had issues starting at 5:15-ish. I had seen that behavior before and applied the previous fix. The monitoring showed a half dozen recoveries. So I went home. Well, on the way home it got much, much worse. So bad it was actually unavailable… but for only a couple minutes. 🙁 I didn’t go to bed until 1am because I was trying to figure out what went wrong.
Wednesday: Starting at 6pm, redux of Monday. Amy worked on that for me, but I ended up staying up until 1am again looking at logs trying to figure out what was the issue.
Thursday: “Emergency” restart of OVIEW to be more proactive. Didn’t get to sleep until 3am.
Saturday: A node in OVIEW crashed. (Middle of the day so thankfully no loss of sleep.)
Sunday: A node in OVIEW failed its 3am restart. This might signal the same cause of the issue in Monday and Wednesday.
Thankfully I get to hand off the pager to a coworker tomorrow.