Last week I logged into the ticket management system to look at updates to cases overnight and saw a pop-up for a “widespread issue”, basically two school, involving LDAP. So I looked up the case. The two schools were on the same cluster. Most likely the problem was on my end, which sucks.
Security people like to change firewall settings to block us. At least that is the most likely fault for these LDAP stopped working cases. From one and then all the nodes for their cluster I could connect to both school’s LDAP servers. So not the firewall.
Next most likely fault was the IP changed or failure in load balancing the LDAP servers. Would two completely unrelated institutions change their LDAP server IPs at the same time? Or have load balancing issues? Both of these seemed unlikely.
So I checked the webct.log on a node and found an error I had never previously seen that there were 50 open LDAP connections so it could not open a new one. Cases were there for both schools. Having never seen the error before, my best guess was the something happened to cause connections to open but not close. A good client would close after less than 300 seconds and ideally less than 30.
So I restarted the node. Users started logging in fine. So I kicked off a rolling script to restart all the nodes one-by-one in the background for no end user impact. It took about an hour to restart all ten. Everyone was happy that resolved the issue.
In retrospect a strategy I might have used could have been to narrow the available nodes to users to those which had already restarted. It took about 6 minutes for each to restart and the restarts happened one by one. The first four also run chat. The next six where regular nodes. Blocking access to this latter set would force users to the first ones and additional nodes as they were fixed. Time to users definitely getting a good node could have dropped to 18 minutes rather an hour. Though… With enough users, this could have overwhelmed the cluster. Before 9:30am, there was not enough traffic to overwhelm three nodes. Maybe better safe though.