I asked the team that manages our Nagios service to add checks on a log file. In typical fashion, I asked for them to apply it to several test servers now then after these work, add them to prod.
Some worked and some didn’t. The failure was with an error:
NRPE unable to read output
Putting that in a Duck Duck Go, I think the error is one of those I despise: Non-specific error message because development doesn’t internally test failures.
Good code will identify the cause of failures by trapping the errors along the process. This allows the administrators to quickly identify the issue, make a fix, try it, determine successful, and move on.
Bad code will send an error, but it could be anything. This forces the administrators to spend hours trying to even determine what the issue is before trying it without any confidence because it unclear. Many times I have to cycle through a half dozen to multiple dozen hypotheses to find the correct one.
Delicate implementations probably happen because of these useless error messages. My theory:
- Administrator hits one of these error messages.
- Admin finds multiple solutions and picks what thinks is most likely cause and tries a solution.
- Admin fails to make a backup or use versioning tools or if does doesn’t revert back to prior version. Or document in the code why changes were made.
- Cycles through several iterations to find the one that works.
- Hacks from step 3 are left behind in configurations.
- Months to years later, it isn’t working again and same or another admin looking at it doesn’t know why weird stuff is in the configuration.
Even if there is a versioning tool, putting in a change and backing it out is probably too much. There is the ability to determine who did it, but even they probably don’t remember when it comes back to looking at the issue later.