A small review on the road to EWB 1.10 and that friday 29th october 2021.
The curvey road
The road to EWB 1.10 was the subject of multiple posts on this blog, but it was a road filled with bumps, unexpected turns and the year 2021 itself.
The decision for EWB 1.10 was made at the start of 2021, it should contain multiple performance updates, changes to its data layers and new search system (non-sql based).
Short after the decision was made, we started the development process and with a first major pre-update in April 2021. But then the world came back alive for us, working from home scaled down and returning to the office took its toll on personal level and the development process slowed down. From that point on to the beginning of August 2021, there were multiple small developments and posts, but not what we wanted in the first place.
At the beginning of August the announcement came that the old authentication system for eve online would be taken down at the 1st of November and we decided that this was the point to really start pushing again towards that date to release EWB 1.10. From that point on the small side projects were merged into the main development branch, the authentication system was changed, the search system was finalized and the changes to the data layers and microservices were completed.
Everything was ready and working on the staging environment and the target date for release was 2021-10-30, together with some maintenance on the servers.
But then came Friday the 29th….
Friday 29th August, the day EWB grinded to a hold
Over the course of the week we received multiple reports of EWB slowing down and Friday after work I looked into the database and discovered some deadlocks in the database servers, which were checked and resolved, after which the performance was going backup.
Then the 20.00 backup hit the database server and EWB grinded to a hold.
I was preparing for a lazy evening and watching Lionear streaming, but then I noticed that the visitors count on EWB was lower than normal and tried to open the website, it didn’t load…
I started investigating the problem and notified Lionear about the problem when it was clear, a simple pod restart would not fix the problem, who decided to cancel his stream and was also starting to check parts of EWB to find the problem. After only minutes of searching, we tracked the problem back to the database server and I started to check the log files and discovered that there were multiple I/O errors and the database was possibly damaged.
At 20.30 the decision was made to bring down the entire infrastructure of EWB to further isolate the problem and prevent more damage to the database and investigate potential issues on the hardware layers.
After concluding that there were no problems at the hardware levels, we started every server backup and also updated them to the latest and greatest. In the same time we started talking about also releasing version 1.10 of EWB early, because the website was already down and prevent another downtime within 24 hours.
So said so done, the servers were brought backup, EWB 1.10 was pushed onto the production stack and the database was checked for corruption and non were found luckily, because at that point we didn’t realized that the backup automation back lashed…
After 3 hours of updating and checking, EWB was backup for us and we started pushing the fit data into the new elastic search system and at 23.45 we were able to open the doors and reveal EWB 1.10, announce that we were back. Than the whisky was poured into a glass and the small issues that came up were fixed.
The backup automation backlash
Somewhere round july after years of manually deleting old backup files on a weekly basis, we decided to automated this process and this worked great, problem with this no one was looking every week into the backups.
This nearly back lashed on friday during the crash. After that the system was backup, and we had our night’s rest, I opened the backup repository Saturday morning to see if the backups were going agian and discovered that all the backups from before 2021-10-30 were gone, which means the automated scripts deleted the old files, but over the course of the week no new backups were received from the EWB database server…
Conclusion
A chain reaction of problems occurring from the background workers, triggered deadlocks on the database, which were removed by us, but reoccurred again later on that Friday and when the backup process hit the database, it was too much, and it froze completely.
But after running for more than 2 years without reboots, it would be planned on short term to do this and perform maintenance to the entire stack, which would also caused a larger downtime.
But after all sweating, cursing and headaches EWB is back online, with an updated stack, revised codebase and new features and new procedures for checking the backups…
This was a personal review from the start to release of EWB 1.10, from the perspective of RaymondKrah and Lionear.
For now fly and stay safe everyone o7
[…] EVE Workbench day, 2021-10-29…. […]