Major optimizations - Done!
published: 2021-03-18
Bit of a background
During the past three months we had system outages. Unfortunately a few of them have reached and impacted the users of Keytiles - our numbers dropped to 0, data has gone from the views... And on the top of that: the first two times users had to report the issue as our Monitoring/Alerting system has failed to recognize the problem... Ok there was no data loss at least, but still: not good!
We are happy to announce!
#homework - done!
The root cause behind all of these incidents was that system got occasionally overloaded when something has happened in the World - at Corona news level. As we are tracking several big news portals every time when something was announced in Austria, Germany, Slovenia, Hungary or other countries and a news website (customer of Keytiles) published an article about it visits kicked up in the sky - causing a huge spike in incoming load for the Keytiles system.
We had to learn and analyze a lot!
- In one hand we had to continuously enhance our Monitoring/Alerting systems to early-capture the incoming problem so we can jump in avoiding the issue to escalate at a level where it hits customers.
This went pretty well as after the first 2 incidents we were able to early-detect others and handle them silently in the background. - But in the other hand we also had to find why exactly Keytiles is overloading? It should not! We designed the system the way it should be able to handle even 100% big spikes so why Keytiles is failing to do that? What is slow? Where? Why?
The good news: we were able to identify the week spots in the system and with very high priority we came up with fixes - touching the logic low level.
The enhancements of the Keytiles system were released into PROD yesterday (2021-03-17) and the effect was shocking! This time luckily positively shocking... :-) How much? This much!
System load before and after the deployment (the red line - if not obvious.. :-)) in our key server elements - click to enlarge!
(Taken from our Prometheus / Grafana based monitoring system)
Processing speed of updating tileData when a hit comes in. Earlier it took ~150-200 milliseconds. But after the deployment... well, see for yourself! - click to enlarge!
(Taken from our Prometheus / Grafana based monitoring system)
Bye bye incidents!
Thanks to the optimizations we made now we are measuring outstanding increase in processing capacity of Keytiles (seriously, we measured 80x faster processing speed!) which means it would be much more difficult to overload the system with spikes generated by kicker Corona news!
Besides thanks to the enhancements we made in our Monitoring/Alerting systems we are much better protected now as 3 months ago. Also don't forget that the dev/ops team had the possibility to learn a lot about how to detect/triangle/handle situations very fast before it hits the service quality.
Last but not least - we would like to thank to our users for their patience! So THANK YOU!