Saturday, 15 February 2014

Sticky Metaphors for Software Projects: Bankable Builds, Surgical Mode

In this post I share an experience from a recent project where I influenced the senior management team to change strategy & tactics, by moving away from the standard Development->Integration->QA Team 1->QA Team n (which lasted anywhere between 2-4 weeks) approach to a more controlled Agile, Iterative/Lean approach of weekly Build & Test cycles, to a build-every-three-days, in order to make the deadline (that was fixed). We did this entirely manually, where our Agile processes were still coming of age, and had little in terms of real-world continuous integration. What I wanted to share really, is the way in which I influenced and got people to change their mindset, by using simple metaphors that were actually quite powerful, and can be used to sell similar stories:
  • Bankable Builds - Reusing concepts from The Weakest Link
  • Surgical Mode - Taking a page from medicine - Surgery is about precision, don't do more than what's required
It is interesting that these concepts really stuck with the entire department, so much so that it's become part of the common vocabulary - almost every project now talks about Bankable Builds & Surgical mode!

The post is presented as follows:
Note: Like any high-profile project with intense management focus, the decision to make these changes came from senior management, which put a spanner in the works with respect to granting our "Agile" team autonomy & empowerment. Although a little resistance was felt at the lower levels, the teams nevertheless remained committed to do-what-it-takes to ship the product. The teams and vendors did share their concerns in the retrospective feedback that the one-week release cycle was quite problematic to the point of chaotic with large overheads.

The facts can't be disputed though, that our agility was indeed flawed from the get-go by not having a solid continuous build, integration & regression test environment (lacking automation) in place. It was a struggle, with a lot of manual overhead, but it was a necessary intervention, that got us to launch on time. Not having done so, would have meant us missing the target deadline, the project was too high-stakes to ignore this intervention.

Going forward however, the principles of bankable builds is still a solid one, one that is well known to teams with mature software processes, I can't emphasise the value of investing in up-front automation with continuous build, daily regression testing & automated component / system tests will help make life so much easier for software development teams.

Invest time and effort getting the foundations set up at the expense of churning out code, continuing to do so will build a huge backlog for QA down the line, teams ending up spinning-their-wheels in putting out fires that would've been avoided, had the right discipline been set in place.

Story & Background
On a recent project that I managed as End-to-End Program Manager, that involved the roll-out of a complex Digital TV system for PVR & Video On Demand, which encompassed a brand new broadcast chain as well as brand new Set Top Box software - I spent a lot of time convincing the senior management stakeholders on how to approach final Launch delivery milestones. To do this, I used two metaphors that eventually stuck and has become part of the common vocabulary now spoken across that department: Bankable Builds & Surgical Mode respectively.

The story goes like this: Almost two years into the project, the business decided to implement a course correction to better target the market segment, making it more relevant to the current infrastructure challenges (African countries are behind the curve with ubiquitous, always on Internet, unlimited downloads and fast bit rates), deciding to rather simulate the Internet and push content directly to the set top box, over satellite instead (a.k.a. PushVOD), thereby simulating the whole Internet experience of content-on-demand (Read the write-up here). So we needed to change the STB hardware, redo almost all of the EPG UI, and change the Headend as well - all in a time frame of just 13 months!

So I put together an aggressive plan that called for a highly integrated project team, with continuous deliveries & continuous integration from all vendors, end-to-end. We had vendors located in UK, China, Netherlands, Paris, USA & multiple sites/teams in South Africa (SA) - so in a nutshell, a project team distributed across the world. Locally in SA, we owned the EPG UI component (new development, never been deployed) and to some extent, the STB Hardware design. We also owned System Integration, Component QA as well as Customer Acceptance QA.

To the seasoned Agile / Lean people, this project isn't so complicated. Agile was definitely the way to go, with carefully planned integration stages that proved functionality end-to-end incrementally, iterating progressively to achieve full functionality. You also expect that there would be mature processes in place for continuous integration (Test Driven Development, Unit Tests, Heavy Automation) as well as a coherent, structurally sound system architecture in place…yes??

Not really, what we essentially had was a team of consisting of largely inexperienced people, led by some experienced engineers - but none from the team have ever experienced Agile projects before, let alone a Large-scale execution such as this one. So we had a relatively immature team, with immature processes, there was some understanding of the kind of scale required, but due to the highly pressured time line, all focus was on just getting the product shipped, get-the-job-done -- which was a massively manually-intensive operation, with strong management oversight, that it almost was unfair to even label the project as "Agile" -- yet we tried, and achieved bits-and-pieces in various areas, but not across the board. One of the failures was that, at the outset of the project contracts, nothing was really specified materially in terms of the project methodology - so some vendors tried to use Agile, whilst others did not -- yet we had to still manage releases, synchronised to two-week iterations...

Time was approaching close to the launch deadline that was committed (after a few slippages), and it was down to "Do-or-Die" classic Death March scenarios, when the Exec intervened and installed a Technical Launch Director (LD), who's sole purpose was to resolve technical issues quickly, optimise processes and get the STB build to stable launch candidate (the Headend was largely delivered way in advance of the STB, STBs usually cause most of the trouble and are generally the critical path).

So I was set to work with the new Launch Director (LD), letting him manage all the technical issues, whilst I focused on the time lines. We engaged in many heavy debates in the background that spanned processes, strategy & tactics (most of these debates happened without the project team members being aware of the management tensions on strategy happening in the background), and tried to agree on weekly strategy targets for the team (all communication coming from the LD himself to maintain consistent messenger)…

The one debate that I took about three weeks to convince not only the Launch Director & Product Owner, but also the SteerCo: was to really lock-down hard, and diligently burn-down the issues one-by-one, to create suitable release candidates that we can have confidence in. We should also impress on the vendors for isolated bug fixes / patches only and only for the specific bugs we're interested in. I maintained if we did not enter this mode of working as soon as possible, and carry on with the standard work flows of Build->Release->Triage->Fix->QA two-to-three week cycles, we will definitely miss out on achieving the deadline.

And to do this, I used the concept of the following:
  • Bankable Builds - borrowed from the game show "The Weakest Link"
  • Surgical Mode - borrowed from of course, the medical fraternity
Again, some of you season agile/lean aficionados will say this is not new, it's basically the fundamental of Agile as well as standard software release practises, remember what I said earlier: We were never ready for true Agile, we relied on manual processes, and our vendors weren't used to following instructions that well, nor didn't care for "micro-management"- so we had to play the role of the hard-nosed customer, as SI, we tell you want we need, what to fix, and when we need it by: If we ask for a build in three days, you release!

Having come from a background that was really mature in terms of Agile & Continuous Integration with automatic builds, automated unit testing & automated system testing, to me, it was a no-brainer from day one of the project that we would eventually reach a point where frequent releases will be required as we get closer and closer to launch. Starting from releasing every 4-5 weeks, moving down to 2-3 week, to weekly and then to every 3 days was expected. In fact, I had predicted this would happen in an email I wrote to the senior management, to consider a thought-experiment of what to expect in the six months prior to launch. Alas, it was rather premature to do so, as I got shot down immediately "Don't rock the boat, the processes are working…" blah blah blah blah blah. 

Here's a snippet from that email (sent six months prior to Launch campaign):
Going forward – try this Gedankenexperiment (thought experiment) and sleep on this over the holiday period:We successfully fold in all features by April, with two months (8 weeks) to Launch. We are truly in bug-fixing mode, climbing up to Launch candidate. We have a resource pool of UI QA, SI QA and CE QA people – all sharing a common goal: Test the product and prove it's fit-for-purpose to Launch. What efficiencies can be implemented? The obvious one is to combine the testing efforts, work collectively, turning the test cycle from four weeks down to one week or less. There will be no UI/UX changes, no new features – just P0 bug-fixing. SI release every week, not every two weeks, CEQA happens every week, not every two weeks. Automated regression testing happening daily, full coverage every weekend.
I like to rock the boat, kick the status-quo in the ass…so I pushed on further...

I couldn't sit by and let that happen, so I tried again, this time three months before the launch date - and instead of an email, I got everyone in a room and talked them through what I saw that others weren't seeing (it was quite draining on my energy, morale & confidence). But lo and behold, the very same senior manager that told me not to rock the boat, came around and was completely won over!  Next steps were to convince his direct reports, including the Launch Director, Development & SI Managers, and then the rest of the team.

The metaphor is really simple: You have a fixed deadline, the rate at which you're going is not good enough, and you will overshoot your mark. Given the metrics we have with defects outstanding, current productivity rates, it is clear that unless we make an intervention, you will miss this hard deadline.

To do this: You need to bank every decent build you can get. Target the core critical defects first, isolate only the crucial ones that are going to cause the most pain to users, get these fixed, preferably one at a time, or group them to themes and then incrementally bank each build. The more builds you have in the bank, the better are your chances for launch - if the next build regresses badly, you just go to the last banked build. Gain confidence - bank. Improve an area in performance - Bank. Serious showstopper fixed - Bank. Serious reliability problem nailed down - Bank. 

The idea is really simple: Continuously bank, in a controlled and incremental manner.

Granted - there is a lot of manual administration and overhead required: We have large teams to manage, multiple vendors, SI need to produce builds more frequent than they're used to. There is not overnight regression testing. We are not comfortable collecting too many bugs into one release, without a careful, quick way of regression testing - so we have to limit the number of bugs per release into manageable chunks - these chunks are aligned to core issues the business wants fixed. Each chunk is tested - once happy, bank.

We have loads of manual testers scattered around the project. I convinced the management team to reign them in - bankable builds need to be tested quickly, get the whole team in on it - reduce your test cycle from 4 weeks, to one week. Reduce further to 3 days if possible, using risk-based testing, because each bankable build is limited to small number of bugs - so risk of major regression & instability is low.

In the 12 weeks you have remaining, at the current rate you will only fit in 3 more release candidates. If you take the weekly bankable build approach, you can get close to ten release candidates, each one in itself being a launch candidate build because we would've banked it already.

The final blow usually comes in with comparing the scenarios from the game show The Weakest Link. In this show, a team of people must answer questions in each round, each correct answer gets them some money, the money accumulates per correct answer (where the team have the option to "Bank" to save that money), an incorrect answer leads to all the money being lost (if not banked), and the team starts over again. Banking too early isn't good, since there is a time impact. Banking too late (time runs out, or incorrect answer) results in losses.

In pretty much the same way, software releases need to be banked at the right time. Banking too early just causes wheels to spin, software isn't ready. Banking too late, of course missed your deadline. Failing to bank, i.e. wrong answer, you start from zero again because you don't have any stable release to go back to (you've built feature upon feature, code is now a mess, very hard to go back and reach a point of stability again - so go back to start). This may sound like blasphemy to seasoned software professionals, but it does happen, especially with young, undisciplined teams.

In order to get "bankable builds" one has to be surgical - only fix what is really really necessary, so you need to enter into Surgical Mode...

Surgical Mode is really about delivering bug fixes only and only for the really critical burning issues, preferably one fix at a time, such that the fix can be backed out if necessary (this doesn't fit completely well with Surgery by the way, no real undo, just stop and wait).

Nevertheless, the comparison is still quite powerful. Just as a surgeon would take pains to investigate course of action required, targeting specifically the original problem area alone, and if during the course of the surgery uncover other areas for concern, makes the decision on whether deviating from current plan would help, or postpone if it can be tackled later - the same rationale is required when qualifying a software release build:
Target only the main area for concern, don't deviate from the agreed course of action unless new issues are related, fixes can be lumped together. Be surgical - avoid unnecessary inclusions. Be specific. Be prepared to back out...Be surgical, precise, clinical!

Just one snippet from the feedback I received from the core decision maker:
Hi MuhammadThis one week cycle you have introduced on [The Project] is a stroke of genius. We have a lot to thank you for on this project, as you have saved our collective asses several times. I for one, really appreciate the quality and quantity of effort you put into supporting usWhen we make it on the 1st August you should be able to look back on this project with a great deal of satisfaction. [This Business] is not the easiest place to bring order to, but you can't fault the guys on their commitment to making things happen:)

Have a great weekend my friend!
Regards XXX
Recommendations I set-off with the SteerCo at Project Kick-Off
At the outset of the project, this is what I prepared for management, that still holds true to this day (3 years onwards the team is still pretty much on their way to achieving these goals):

In order to have a practical chance of achieving the target for launch, a different approach to Project Delivery is being taken: Continuous Delivery. Continuous Delivery is a methodology born from Agile & Lean Manufacturing Principles – the idea of continuously iterating functionality upon functionality, thereby reaching the end-product. By integrating early, testing early, and failing early, issues are highlighted early in the project life cycle, consequently highlighting risks early as well.

The aim for [PROJECT] is to start End-to-End testing as early as possible. This in itself has its own challenges. Apart from being an entirely new concept to [COMPANY], the fundamental changes required are that of processes and changes in the work behaviour. Traditionally [COMPANY's] testing approach was very sequential, staged, the classic “Waterfall” approach to Software Product Delivery. This approach requires detailed analysis & planning up-front, ideally requiring a stable plan, there is a heavy reliance on detailed documentation. 

Given the change in direction with the [EPG UI], the project has experienced a significant blow in terms of losing the detailed documentation that was previously available as part of the [ORIGINALLY DESIGNED] User Experience. This Product Documentation was a result of 18 months’ work that has literally gone down the drain, and has forced the UI team (both development and test team) to re-start from scratch. Unable to have the luxury of detailed documentation up-front, the UX development has taken the Agile Approach to Software Development, incrementally delivering the UX on two-weekly cycles (sprints). This implies the documentation is also adapted, just-in-time to support the development team sprint-by-sprint.

Lack of up-front documentation does strain upstream teams, specifically the UI-QA, SI-QA, CE-QA and Field Trial teams. Hence the QA teams have to adapt to this new mode of working, to plan week-by-week, keeping abreast of the development planning in more detail than they are accustomed to. This implies a similar level of Agile Planning for QA teams as practiced in the development teams. This in itself is a radical departure and requires a change in mind-set. Coaching sessions have been setup to drive through the philosophy of Agile and the current delivery approach.

Field Trials will be done on a continuous basis, starting from basic functionality. In December, the project released a first milestone of a basic Zapper – as a way of building confidence with the team and publishing early, even though the product was very much in its infancy. That is the spirit of Continuous Delivery: starting small, delivery incremental functionality leading up to a stable release. This approach again, is a departure to the current [COMPANY] implementation of Field Trials. Field Trials will receive development releases and will be asked to provide early feedback. Some of the feedback might be ignored if reported too late. There will be missing functionality for certain features, but the release to field trials will be a conscious decision as guided by the Product Owner.

Continuous Delivery imposes strict adherence to detailed planning on all streams of work. There needs to be extremely synchronised planning sessions and coherent defect review and feedback sessions. Processes need to be efficient and running like a well-oiled machine. 

Continuous Delivery also imposes the notion of Continuous Quality. The Quality Criteria for this project will be expanded ten-fold, introducing strict quality control gates that has previously not been enforced in the project. This means that all levels of component delivery and QA must be exposed to quality control, specifically ensuring Regression is under control, and strictly monitored. The option of rejecting builds, preventing onwards testing will be enforced in this project, which again is a departure from current practice. A detailed template for defining Defect Severities and Priorities will be introduced to better manage and control Quality.

Continuous Quality depends on a mature strategy for Automation. Current industry best practices demand automation testing starting with component-level unit test automation, component-group automation, subsystem and system automation testing. Testing for regression at all levels of the Software Stack as part of the release and acceptance criteria will ensure defects are captured well in advance of higher level QA activities. [YOUR COMPANY] is just starting with [STB] automation, the aim is to have STB automated testing targeting performance and stability. The same level of automated testing is expected from Component Vendors, which at the time of this writing, the maturity of automated regression testing is not fully understood, nor is it readily available to QA teams.

Given [YOUR COMPANY'S] immaturity with Continuous Delivery, this in itself places a challenge on teams. Processes in general take time to mature. Like any process changes, there are inspection and adaptation phases that are essential processes for monitoring and controlling the process implementation, which means the project must be prepared to make U-Turns if required should the intended processes not be working as efficiently as planned. However, given the current time line, there is no room for error, and no room for changing processes mid-way through the project. The Agile practice of Retrospectives will be applied at the Programme Level in an effort to ensure processes are reviewed. Retrospectives in a nutshell involve asking the following questions “What is Working Well?”, “What Hasn’t Worked So Well?”, “What can we do to help Improve?”

Continuous Delivery implies the management and tight co-ordination of parallel work streams. The project operates around a central heartbeat, driven by System Integration. The current heartbeat is a time-box / iteration / sprint of two weeks:
  • UI Development: 2 week Sprints
  • Middleware Development: 2 week Sprints
  • STB SI-QA Release Cycles: 2 week Sprints, but lags development by 1-2 Sprints. The goal is to never be too far behind, aiming for lagging by one sprint
  • Headend SI-QA: 2 week Cycles – the concept of Sprints not yet introduced
  • End-to-End SI: 2 week Cycles – the concept of Sprints not yet introduced
  • [MIDDLEWARE] End-to-End Testing: 3 week Cycles
  • [UAT] CEQA: 2 week Cycles – the concept of Sprints not yet introduced
  • Field Trials: 2 week Cycles depending on releases from STB SI – the concept of Sprints not yet introduced

1 comment: