After the release of the pre 1.10 patch of EVE Workbench, we spend the weekend on fixing the upcoming issues from users, testers and the new logging system, and we continued on updating the backend systems.
A small overview of things that were done:
- Fixed: Missing skills were not showing in fitting screen
- Fixed: Getting SQL errors when appraisal data was malformed
- Fixed: User access tokens were not properly refreshed and caused ESI errors
- Fixed: ESI type background task was not working properly and deleted types when response was Not Modified on the page
- Added: Reauthentication popup when the access token of the user is not valid anymore
- Added: Progress monitoring to backend worker system
- Added: Health and Readiness probes to the website
And with the list above, there were several small errors and bugs that popped up over the weekend and were fixed right away.
Then it was Sunday afternoon….
Then Sunday afternoon came around and everything was still going smoothly with the constant updates to the production environment. Until 14:30 (Amsterdam timezone), during a small update suddenly the website became slower and slower until it came to a grinding hold.
After some investigation we found the cause in our pod scheduler which decided to simply put everything on one node within the cluster and with that causing that node to become unresponsive. This forced us to take down EVE Workbench and all its support systems and redeploy.
Eventually EVE Workbench came back up again, but within minutes triggered the same issue again…, after looking again into the cluster we found the root cause in the configuration of the cluster resource scheduler which was there since we set it up 1 year ago (why it surfaced twice within minutes, Murphy’s law I guess), after fixing that and redeploying everything again, EVE Workbench was back up and the dev weekend completed.