Check Backups

I get a daily report about backups for each of the projects. One particular one has shown some odd results. The report has columns for: Completed, Successful, Partial, Failed, Missed, and Active. The particular problem is that backups show up in none of those columns when it is actually still actively running. (So it should show up as active?)

So I wrote the other day a Bash script to check some things. In the future I can quickly assess if this is the SNAFU (Situation Normal All F***ed Up) or something else. Really, the other day I had to track back into my history this information to do it manually, which is stupid.

#!/bin/bash
###########################################################
# Check backup status.
# 2015-JUL-24 Ezra Freelove, email@domain.com
###########################################################
# Look for running processes
echo “… Backup processes”
ps -ef | grep [p]rocname
# Report logs
echo “… Logs”
ls -ltr /path/to/agent.log /session/path/to/clientlogs/*.log
echo “==========”
# Specific lines excluding spam and blank
tail -100 /path/to/agent.log | grep -vf /home/me/myscripts/backups/exclude/ckrunning.txt | grep -v “^ $”

If the backup is still running, then the ps will show the PID and time started.

In addition the logs are reported with the most recently changed at the end.

The agent talks to a central service to find out when it should be doing stuff. The exclude/ckrunning.txt file uses the below entries to ignore spammy lines where the agent is checking but not told to do anything. This ignores about 70 of the 100 lines when everything is normally operating from my check.

  • Sleeping
  • Workorder received: sleep
  • Requesting work

Long-term, I guess I should ask why the long-running backup does not appear in the Active category when it obviously should.

Separating Bash Histories

Years ago, maybe 2011 based on the age of files, our Systems group moved our Linux home directories to a central system. My only real complaint about this move was finding anything I needed in my Bash history. See, I am terrible at remembering things and often make typos. It is easier to go back in my history to a prior command and either run it or modify that one and run. The same home directory across all these systems complicated things by co-mingling commands. I was able to find things. Just eventually. That seemed inefficient.

Eventually, this situation annoyed me to the point I decided to fix it. And the fix was so simple it is amazing that I did not immediately address it rather than suffering with it for a couple years. (Well, actually, we picked Desire2Learn before the change so 90% of my server responsibilities were on Windows. Only when I was promoted to a Technology Strategist and returned to majority work in Linux did it get annoying enough to address.)

The fix? Add hostname to the HISTFILE variable in .bash_profile.

export HISTFILE=”${HOME}/.bash_history.`hostname`”

Apparently I made the change back on December 18th. In the six months since, I have not noticed any oddities with the history. This morning I noticed that I have about twenty different host named history files of various sizes and dates.

Given the number of files, while writing this post, I decided to re-organize these into a directory. (An organized home directory is a happy home directory. Heh.)

export HISTFILE=”${HOME}/.bash_history/.bash_history.`hostname`”

Then I ran these.

mv .bash_history .bash_history.org
mkdir .bash_history
mv .bash_history.* .bash_history

Then I exited which dumped that session’s history into a file in the old location. I logged in again and used cat and the output redirect to append those new lines to the correct file in the new location.

Exited again and logged in again. And everything still looks good.

Counting

In the beginning I used TOTAL=`ls /d2lmig/*/*/bak/*/* | wc -l` to get a total count. All was good.

Until at around 55,000 files I got: -bash: /bin/ls: Argument list too long.

Then I used TOTAL=`find /d2lmig/*/*/bak/*/* -name *.bak | wc -l` to get a total count. All was good.

Until at around 90,000 files I got: -bash: /usr/bin/find: Argument list too long.

It happened that I was already using a for do done loop and getting a count for each bak directory. So I added within the for loop:

TOTAL=`expr $TOTAL + $COUNT`
if [ $RUNTYPE = “INTERNAL” ] ; then echo ” Running total = $TOTAL” ; fi

(INTERNAL is a value I pass at the command line that controls whether the email is sent to me or other parties.)

Comment out the TOTAL using find. Voila.

I know, I should have done it this way in the beginning. Sloppy code to go back and get the total. Maybe because I have this post here, I reference it and not be so bad in the future.

IMS Import Error When Node Is Down

This is what I got when a node was down while I attempted to do an IMS import in Blackboard CE/Vista.

Failed to upload files, exiting.
Cause could include invalid permission on file/directory,
invalid file/directory or
repository related problems

The keywords permission, file, and directory in this would have sent me anywhere but to the right place. The keyword repository made me suspicious the node had a worse issue than just bad permissions. So I looked for the most recent WebCTServer log and found it to be a week old. Verifying the last messages in the log confirmed it had been down for a week.
🙁

To see anything in the log questioning whether or not the node was running would have saved me lots of time this morning.

Added to my .bashrc a couple lines to provide a visual indicator how many are running.

JAVA_RUNNING=`ps -ef | grep [j]ava | grep -c [v]ista`
echo ”  — No. Vista processess running = $JAVA_RUNNING”

Better might even be to have it evaluate whether less than one or more than two (or three) are running. If so, then put something obvious the world is falling. Maybe later. Took me just a couple minutes to write and test what I have. The rest will come after I decide what I really want. 🙂

Also, it wasn’t running because a coworker had run into a situation where the fifth node would not start. She thought maybe it was because the number of connection Oracle would accept was not high enough. I suggested a simple test would be to shut down a node and see if the problem one suddenly works. I happened to be working with the one she shut down for the test. It happens she had just started a script to bring them up when I asked.