Wednesday 23 January 2013

SI QA Sanity Tests debate...round one

Today, after 19 months of joining the company & about 18 months from getting involved with ongoing projects, we finally decided to have a discussion around the topic of "Sanity testing" since I had a tendency to always comment on why we continued with downstream testing even though "sanity" had failed, but I was always overridden that it was deemed acceptable for the current stage of the project, and since I wasn't directly involved in managing that project, so I let it be - the timing wasn't right yet...

But even today, on a project that I do have direct involvement in as the overall programme manager, having influenced & steered much of the development / integration / test process improvements to bring the project back on track again, there is still some confusion around what "sanity testing" really means. The technical director managing the launch, who's position in the corporate hierarchy is one level above myself (and the QA manager), has maintained a different view of Sanity Testing than I. But it seems I'm not alone in my view, having recently gained the support of the QA Manager as well as the Product Owner - so we decided it was time to meet and discuss this topic in an open forum, to try and reach an understanding, to avoid confusing the downward teams, and establish a strategy for possible improvements going forward...easier said than done!

One of the challenges I face almost daily is that of enabling change across the project team, sticking my nose into Development / Integration / Test departments with the aim of steering them in the right direction, to help increase the chances of the project delivery.  I do this, even though I myself, strictly speaking am positioned within the organisation as a "Programme Manager or Strategic Planner", wearing the hat of Project Management, as if applying PM pressure across the team isn't enough of a thorny subject already, I still go and push for process changes in Dev / Int / QA - am I a sucker for punishment or what?? But I can't help myself really, because my technical background, I've been through all the stages of software engineering as both engineer and manager, and have worked for the best companies in the industry, who, through the years evolved to using best practises -- so when I see things being done that deviates from what I consider the norm in the industry, I just can't help myself, but to intervene and provide recommendations because I can help short-circuit the learning curve and help the team avoid repeating expensive mistakes!

And so with the topic of QA Sanity, we've never respected the term, and continue, despite failing sanity, to proceed with testing an unstable product, in the hope of weeding out more failures:

  • What does Sanity mean? 
  • What do we do when Sanity fails?
  • Do we forsake quality until later in the project?
Overview of the Work We Do
We are in the business of Set-Top-Box (STB a.k.a. "decoder") application development, integration & testing. If you have digital TV from Sky / Dstv / DirecTV / BT - this is your PVR & EPG, the TV Guide application you use to watch TV!

The STB software is a complicated stack, consisting of many components, but can be summarised as containing the following core pieces: Application, Middleware, Drivers, CA, Kernel. The Middleware enables access to APIs (Application Programming Interfaces) for EPG (Electronic Program Guide) development, the Drivers expose interfaces for the Middleware to access the Hardware, CA is the security engine that protects PayTV revenues by managing access to content, Kernel is the Operating System providing the Drivers / Middleware / Application / CA with core system services. All of these components are independently developed & tested, come together through the process of System Integration, the full integrated stack is subjected to Testing / QA, before exposing to real users. We manage all of this activity, by doing the Application Development, System Integration & QA ourselves, using external vendors to supply Middleware / Drivers / CA / Kernel / Hardware components.

We are a young outfit, just under five years old I believe, and have a long way to go. Starting from scratch, one would hope that best practices in Software Development / Testing were instrumented from the beginning - but no so. We're now in the classic position of having created a lot of inefficiencies due to poor practices, and have got used to certain misconceptions that will take some time & energy to change and realign the overall mindsets of individuals & teams.

We've made a start though, part of my brief was to take the team to the next level, or as close to the required maturity level, but at the same time, delivery on a critical project that's been slipping for the last two years. We have come a long way in the last 12 months, senior management have been open to many of my recommendations, especially when my recommendations were echoed by independent external consultants & auditors hired to coach & inject process improvements...

The Current Problem
Focusing on this current problem of Sanity Testing: We setup the teams, have the workstreams in place. There is a QA team that follows QA procedures: Sanity, Functional, Exploratory, Soak, Performance, Stability tests, etc. We have Entry Criteria, Exit Criteria, Acceptance Criteria, etc. all defined on paper. We communicate test plans & test strategy in advance. But nobody seems to be paying attention. When Sanity fails, testing continues. The perception is we're still early in the project (which has been going on for 3+ years now) to be too strict on quality, we don't want QA to be sitting and doing nothing, when they could be finding bugs we need to fix. Reasonable concern, but then why have Sanity in the first place??

My Perspective
I sketched the below picture on the whiteboard to frame the discussion from my perspective, experience & expectation.
Whiteboard Rough Sketch

Pretty Picture - High Level - 5+ Components feed into Integration
The picture should say it all. From my side, I've always been used testing early from upstream, this was done even before I transitioned to Agile methods almost seven years ago. Sanity represents the Core functionality of the product that Must Always Pass. It is the first level of testing for any test stream, even in Developer testing. When Sanity fails, ideally everything stops until the Sanity failure is fixed. Of course in reality though, sound judgement can be used to continue with bits of testing not affected by that particular failure. But I've always insisted that vendors must turnaround a fix to resolve the sanity failure, as soon as possible, as it'll prevent full testing downstream. I expect development team to have instrumented automatic unit testing, continuous integration, regression testing on an almost daily basis. Quality starts early, and as upstream as possible. System Integration is responsible for delivering a stable build to downstream customer, hence Core Sanity must pass. Sanity tests take a slice of the entire product feature set, if baseline core functions are not working - there is no point in further downstream testing. Stop everything, give us a patch, we continue.... I was at first alone in this view, but it seems I've got the support of the QA Manager, Product Owner & Dept. Manager...

The Technical Director's Perspective
This guy however, is taking longer to influence. Concerned with meeting the timeline / delivery schedule, there is the natural response that focusing on Quality and applying Strict Quality gates prematurely is going to delay the project. He maintains that serious testing only starts about three months before launch, when all features are developed & suitable stability reached, then we enter the massive task of flushing out all the quality issues. Moreover, a product will always launch with existing showstoppers. That it is entirely acceptable, in different stages of the project to apply different quality levels. He maintains the view that system development takes the following phases:

Bleeding Edge -> Alpha Phase -> Beta Phase (Functionally Complete)  -> Launch Candidate

As we move from one stage to the other, the software becomes more stable, more functionally complete, and the incidents of defects is expected to increase until we reach Functionally Complete. It is only at the Launch Candidate phase that we apply stringent quality gates & insisting on Sanity to always pass. But it is still a rational approach that sets targets for the level & types of defects allowed to exist as we move from one stage to the other. It is basically, in essence, no different to my generic release campaign methodology.

This is a realistic, practical & traditional (although naive IMHO) argument, especially if this is how you've always viewed the world and not seen Agile practices in action. Difficult to change mindset if this is the only view of the world a person subscribes to. On top of this, he outranks me...typical office org structure politicking...

But I've always maintained the view of delivering quality early, at all stages. Focusing on quality early, might at first seem to delay the schedule, but overall, it significantly improves the chances of product launch, by having shorter QA cycles & infield testing (which can last for months on end) to weed out all bugs that would've been fixed sooner. This is not my view, this is the way most of the world is heading...

But for now we've agreed to at least settle on a common understanding of what Sanity test cases mean, are they testing core functionality or not? Create a Core Sanity Suite, establish the core failure conditions, and agree to prioritise Sanity failures above everything else. We will also consider applying pressure on Vendors to deliver Core Sanity fixes as patches in the immediate test cycle, instead of parking them for two releases later.

What the rest of the world is doing?
I've written previously about how the different worlds of testing & the power of continuous integration, regression testing. That's what the rest of the world is doing, and especially when you want to apply Agile principles, then there is no real stage left in the project for "stabilising & bug fixing" - it should be a continuous process, built in at every level of the software stack... I have personally seen this work in small & massively large projects, so I believe it can be done!

Conclusions
We agreed to settle on removing the ambiguity around SI Sanity Test Cases. First we will at least identify the Core test cases that are valid for proving Core functionality. For example, "Smooth rewind doesn't work nicely" should not fail the core test case of testing Rewind - "Pass - Rewind works but observation is that it's not smooth". Once we've settled on Core test cases, going forward, any failure in Core Sanity must be addressed immediately. Exploratory testing is not repeatable and is often subjective. Whilst there's tremendous value in doing Exploratory / Freeplay testing, the failures must be reviewed by the Product Owner to assess the impact of the failures, where a decision can be made either to stop testing or proceed with downwards testing.

But getting teams to adopt a stricter controls for future phases, is still a matter for debate :-( Let alone introducing Agile concepts where, ideally, everything should happen within the sprint or iteration, without passing over to "downwards test streams".

Doing Agile properly in STB projects is a topic for another day...

1 comment:

  1. Sanity is exactly that: core functionality that must always work. If sanity fails, team should do everything to rectify it. Don't park sanity tests for later. You will only suffer quality issues later...good luck with trying to get these guys to change!

    ReplyDelete