Sunday, 11 August 2019

On using metaphors for Tech Ops

I like finding metaphors from other worlds outside of technology or engineering that can help simplify concepts for business people. Using such comparisons "is like a..." can actually be quite a powerful way of connecting or contextualising, especially when these people (business stakeholders mostly) are not interested in the technical details at all. All these folks care about, when there's an operational issue impacting business/customers, are: "What's the issue? Why did it happen? When will it be fixed? I need 15 minute updates please. Not interested in detail. Just want to know when it will be resolved. And don't forget Root Cause Analysis RCA document & Consequence Management!When I get these remarks,  I'm like...left dumbfounded...and decided to find another way of communicating to manage expectations better...



Despite the term being popularised by Trump last year, which was incidentally around the same time I introduced the term - in the software engineering community, "draining the swamp" is used to describe some refactoring and possibly restructuring of code and architecture, to address key bottlenecks, instabilities and optimisations. This could also mean revisiting original design & implementation decisions. Some people in software might put this under the bucket of "paying down technical debt" whilst others might just put this down as a natural part of the software evolution.

I used this term to prepare business people on the impact of the changes we planned to make. This was in response to a business ask of anticipating the biggest load of the tech platform stack - will the system cope with a "Black Friday-like event"? The simple answer was NO, not in its current form, that we would need to make some aggressive changes to the platform - like scaling out to multiple datacentres. The stack was built as a monolith (multiple services & components, some micro services, running on traditional VM infrastructure, not containerised, etc.) and so the best chances of scaling for load was to move the stack to run off multiple data centres, where we could scale up on each datacentre as well, but still had bottlenecks with share database cluster off one primary datacentre.

Suffice to say, this stuff goes above the heads of the business people. So I said, it's like draining the swamp. When that happens, we are bound to find some nasty things lying beneath the surface, things we don't see today, and can only see when we start draining the swamp. When this happens, and we hit some boulders or some unforeseen monsters of the swamp, we are bound to experience downtime and an outage or two. Don't panic, this is expected, given our working constraints, that is, make the changes on live production environment.

#draintheswamp went down pretty well with business. They got it. Left us alone. They trusted us! They managed the customer & business impact when we needed help, and our tech teams worked heroic hours to get the stack running off multiple datacentres, just in time for the biggest event of the business for the year (2018), our user numbers reached record highs, and the platform did not fall over! So round one of #draintheswamp was done, but the swamp was not completely drained out...

Until earlier this year (2019), when we kicked-off another round of #draintheswamp to start the migration to cloud stacks, starting with containerisation...which led the platform to suffer a series of outages at the most inappropriate moments...customers were pissed, business impact teams were on the back-foot, social media was killing us, the changes for #draintheswamp was starting to kill the platform...when I resorted to introducing another metaphor: #ICU. Folks, the tech stack is in some severe TLC, we are instigating #ICU mode. Life support is initiated, vitals are not great, but patient is surviving, and needs high level of critical care...

Technical Operations / Site Reliability Engineering is not so different to running a hospital's ER Emergency any time, despite monitoring vital signs of existing patients (system components of tech stack), or doing day-to-day operational management of the ward (infra maintenance), an event could occur that can just spike, like a natural disaster, accident or terrorist attack (hardware failure, critical component dies, database crash unanticipated, etc.)...ER staff have to triage quickly (Tech Ops also have to triage), make life/death calls (Tech Ops when to call a P1 incident & inform business), decide on severity of the injury (requires immediate attention, operate now, or can wait??)...the same holds true with bringing back a technical platform from death to living healthy why wouldn't we want to reuse medical terms?? I think it makes perfect sense!


The gist of this metaphor is simple: ICU/CCU means Intensive/Critical Care Unit. When a patient is in ICU, it is serious stuff, urgent priority, focused attention to monitoring, diagnostics & multiple treatment options, using whatever means necessary to enable a positive outcome for the patient.

The typical flow to recovery to health looks for these transitions:

  • Start in ICU/CCU, remain there until a period of time where interventions applied & life-support is no longer necessary, then...
  • Transfer to High-Care facility, remaining close to ICU, but stable enough to warrant reduced focus and attention as compared to being in ICU, but be prepared for surprises, so best be close to ICU ward and not far away...wait until doctors give the green light to move on to...
  • Normal / General Ward...the last stay before checking out to go home. Vitals and all other required checks all pass before discharged for home...
Although being in #ICU is rather stressful for everyone, there is a high sense of urgency, and the pressure to respond to business is quite intense (especially when Risk/Governance demands "Consequence Management"), we can draw additional parallels from medical ICU, like:
  • Remaining calm, address the topics / unknowns in logical manner, using tools of diagnosis you've trained for (Technical tools like Five Whys, etc.)
  • Running tests, take blood samples, etc. (Tech/Data forensics, logging, test hypotheses, etc.)
  • Call on other medical experts to bounce diagnosis / brainstorm (Involve as many tech experts to offer new perspectives)
  • Implement treatment plan according to hypothesis, wait for new results (Fix something, wait, analyse result, before making additional changes).
  • Communicate clearly to patient's stakeholders (Communicate to business stakeholders transparently without hiding or being defensive...come clean).
  • Pray :-)

Unpacking the medical terms...

According to Wikipedia, an ICU:
An intensive care unit (ICU), also known as an intensive therapy unit or intensive treatment unit (ITU) or critical care unit (CCU), is a special department of a hospital or health care facility that provides intensive treatment medicine.
Intensive care units cater to patients with severe or life-threatening illnesses and injuries, which require constant care, close supervision from life support equipment and medication in order to ensure normal bodily functions. They are staffed by highly trained physiciansnurses and respiratory therapists who specialize in caring for critically ill patients. ICUs are also distinguished from general hospital wards by a higher staff-to-patient ratio and to access to advanced medical resources and equipment that is not routinely available elsewhere.
Patients may be referred directly from an emergency department if required, or from a ward if they rapidly deteriorate, or immediately after surgery if the surgery is very invasive and the patient is at high risk of complications
When a technical platform goes into #ICU, we do the following:

  • We're on red alert - life threatening to business (customer experience is tanking)
  • Stop all other development work, or reduce planned work as much as possible by pulling in all the people we need to help (speaks to higher-staff-to-patient ratio)
  • Pull out all the stops, bring in experts, tool-up with advanced resources and equipment
  • Communicate daily to all business stakeholders
  • Perform multiple surgeries if needed (hotfixes, patches, etc.)
  • Run multiple diagnostics in parallel (think Dr. House)

According to Wikipedia, a High Care/Dependency Unit:
high-dependency unit is an area in a hospital, usually located close to the intensive care unit, where patients can be cared for more extensively than on a normal ward, but not to the point of intensive care. It is appropriate for patients who have had major surgery and for those with single-organ failure. Many of these units were set up in the 1990s when hospitals found that a proportion of patients was requiring a level of care that could not be delivered in a normal ward setting.[1] This is thought to be associated with a reduction in mortality.[1] Patients may be admitted to an HDU bed because they are at risk of requiring intensive care admission, or as a step-down between intensive care and ward-based care.
According to HealthTalk, the last point to recovery is General/Normal Ward:
People are transferred from the intensive care unit to a general ward when medical staff decide that they no longer need such close observation and one-to-one care. For many people, this move is an important step in their progress from being critically ill to recovering..
'Nuff said, IMHO the similitude mentioned above should be more than self explanatory ;-)

No comments:

Post a Comment