For us at CheMondis – as a software development company – data security and the security of our marketplace are essential. Therefore, it is self-explanatory for us to ensure that we fulfill our safety standards anytime. To do so, team engineering is from now on performing so called “fire-drills”.
What Is Behind a “Fire Drill”?
The literal meaning of a fire drill is an exercise to prepare for a fire or other emergencies in a building. People prepare by doing whatever they would in a situation of fire – evacuating a building, grabbing a fire extinguisher, etc. as if an emergency had occurred.
Within the context of software development, a fire drill has the goal to prepare for emergency situations like hardware fails, major bugs, hacker attacks etc. – basically for the case we face problems within our code, the heart of the marketplace, without knowing the reason.
*Fingers crossed* that such a “fire” will never occur.
The Scenario.
On the day of the fire drill our engineering team had a meeting in the morning and our Head of Engineering Max Kugland (the initiator of the fire drill) explained the exercises for the rest of the day.
To make sure we do not create any real problems during this exercise and our marketplace survives the fire drill safely, we created a separate completely isolated environment. In this environment they could let the chaos monkey loose.
Max and a taskforce designed a couple of scenarios in which the marketplace service gets severely disrupted.
- networking problems in the cluster (slow network, dropped packages, domain name system resolution failure)
- sub-system outage (failure of a sub-system, system shuts down, can’t be restarted, disk full…)
- partial data loss (deleted database tables, database goes down, automatic failover)
We Follow the DevOps Approach.
That means we do not have a dedicated operations team. DevOps is just a set of practices which a team follows to not only produce and test code but also build, package and release the code and operate the underlying infrastructure. Basically, the whole cycle. This means every (backend) developer does not only write code, but also possesses in-depth knowledge about the infrastructure the code runs on and the interactions of the various systems involved.
Why Did We Focus on Infrastructure?
Parts of the exercise were to…
- figure out if something is wrong
- to pinpoint and analyze what is wrong and which systems are affected
- to contain the failure and protect other systems
- to develop and deploy countermeasures
- to learn from the failure, increase system resiliency and fault tolerance
The Engineering Team Was on Fire.
The team successfully spent the day on solving these issues.
Based on the learnings of the fire drill, we improved and extended our system runbooks, so we not only practiced for the emergency, but also have an easy to follow step-by-step guide what to do in case something blows up.
Prevention Is Better Than Cure.
We extend the catalogue of failure on a regular basis and intend to perform a fire drill quarterly, so we are prepared when lightning strikes.
So, there are many more fire drills to come…
Interested in Becoming Part of the Engineering Team?
Apply now via the following button to become part of #teamchemondis.
Thanks for taking the time to read the CheMondis blog.