Monday 16 February 2015

Non Functional Requirements and Sluggishness


There is a lot of talk about whether writing code, or creating software, is really an art, a craft, or is it a highly disciplined, software engineering discipline. This post is not about this debate, although I do wish to make my stance that great software is just as much it's a craft as is a software / systems engineering discipline. One should not discount the analysis, rigour that should go into the design of any software system, be it relatively simple 4th/5th tier high level applications built on very easy frameworks (e.g. WebApps, Smartphone & Tablet frameworks), or very low-level, core, native software components (e.g. OS kernel, drivers, engine services).

Take the Set Top Box (STB) for example, a consumer device that is usually the central and focal point of the living room, people expect the thing to just work, to perform as well as any gadget in the household, being predictable, reliable and always delivering a decent user experience.

How then does one go about delivering performance in a STB software stack? Is it the job of the PayTV operator to specify in detail the NFR (Non-Functional Requirements)? Or is it the job of the software vendors (Middleware / Drivers) to have their own internal product performance specifications and benchmark their components against their competitors? Does the PayTV Operator benchmark their vendors based on the performance of their legacy decoders in the market, for example: this new product must be faster than all infield decoders by a factor of 20%. Or is their a collaboration between the technical owners (i.e. architects) to reach mutually agreed targets?

Take the problem of Sluggishness. According to good old Google,
sluggish
ˈslʌɡɪʃ/
adjective
  1. slow-moving or inactive.
    "a sluggish stream"
    synonyms:inactivequietslowslow-movingslackflatdepressedstagnant,static
    "the sluggish global economy"

When we're field testing STBs, over long periods of time (or sometimes after not-so-long-usage), people & customers report:

  • The box got very sluggish, needed to reboot
  • The STB isn't responsive to remote presses, takes 4-5 seconds to react 
  • Search takes too long and I can't do anything else but wait for search to complete
  • Channel change is slow compared to my other decoder
  • When using the program guide or navigating through the menus, the box gets stuck for a few seconds as if it's processing some other background activity - it's very irritating to my experience

This feedback, whilst extremely useful to the product team, is very hard for engineers to characterise without having access to system logs, to trace through the events that led up to the slowdown in the performance that resulted in degraded user experience.

In my view, based on my own experience of writing STB applications and managing complex systems projects, I believe that unless we quantify in fair amount of detail what sluggishness means, then it's a cop out for passing on the buck to other parties: either the product owners didn't take time to functionally spec out the requirements, or the application is using API services that it doesn't have control of...

In the remaining part of this post, I will touch on an example of a previous project of how we handled the problem of sluggishness, and this was through a process of rigorous Non-Functional Requirements focused on Performance Specifications...

Quantify all NonFunctional Requirements (NFR)

In my opening I kinda passed a value judgement on software engineers - having been one myself, I am biased to a way of working in the traditional sense where software was seen as a mainstream engineering discipline, and less as a craft as the new movement - although I believe it's both a craft and a discipline, but what I've come to see, through interacting with younger programmers that either have just graduated, or have under five years working experience, or people who just taught themselves to code, that the appreciation for the finer, deeper topics such as performance requirements driving design, the use of third-party libraries should be done with care, or ensuring that for every user-facing feature is followed up with specific performance requirements -- is lacking, or in most cases non-existent.  

With the move to Agile/Scrum/Lean adoption (which I personally motivate for as a change agent), it's become a little more confusing on who should take ownership of such requirements. Is it the Product Owner, Architect, Business Analyst, or Testers? Hey, hold on, this is a cross-functional team, there's no specific roles here, apart from the Product Owner (PO), so the PO must specify these requirements. So how technical should a PO be? To what level of detail should one specify a product's performance requirements? These are topics for another day.

NFRs are huge, they are both wide and deep and generally cover many factors under the Quality Engineering banner - generally the "ility" attributes such as "Reliability, Portability, Maintainability, Augmentability, Compatibility, Expandability, Flexibility, Interoperability, Manageability, Modifiability, Operability, Portability, Reliability, Scalability, Survivability, Undertsandability, Usability, Testability, Traceability, Verifiability" [Source: Software Engineering Best Practices] 

Set Top System Performance Requirements
In a previous project, the customer appointed a system architect responsible for just the performance requirements alone. Working with the product's feature set (essentially covering all the major features accessible by screens / menus), the architect produced well over 300 atomic requirements, across the entire spectrum, from System Start-Up, most used Screens (TV Guide, Planner, Search, Video-on-Demand, Pay-per-View, Parental control), through to System Shut-down.

These performance requirements were sliced through the entire software stack. Say for example, these requirements roughly looked like:
  • The system must boot up from cold start in no longer than 3 minutes
  • Channel changes should take no longer than 2 seconds regardless of transponder hopping
  • Searching the program guide should be deterministic always, search results to be displayed within 2 seconds of instigating the request
  • From boot-up, to renting a box office movie, should happen within 3 minutes of start-up
  • Time to delete a recording should be less than a second, time to book a recording should be at most a second
  • The decoder must run for three weeks on a flat memory profile without rebooting
  • System memory garbage collection should happen seamless in the background without interrupting the user
  • Service information database changes should take no more that 5 minutes for a complete database refresh
  • Time to search database should always be constant, time to get program information for display on the UI should be no longer than half a second
We had 300+ of these one-liners, we were providing the core services of the software stack: Middleware, including Driver SDK & Conditional Access - and Services API for Application Development. Our services API had to support many application domains: C/C++, Java, Flash, ActionScript, HTML.

Here is a snapshot of what a typical STB System might identify as performance requirements:
Typical Feature Breakdown of a Set Top Box Product:: Performance Areas
How does one go about quantifying the requirements? One way would be to have a look at how Google does Test Analytics, using their ACC model. I think such performance requirements are a natural fit for this model, I've written about how ACC can be used for STB testing here.

Quantify / Breakdown System Requirements requires Discipline
Almost every software stack has a layered design, with clean separation of APIs consisting of one or more components or services. Recall this particular software stack shown here:

When the customer says the following:
The System shall present the program guide data within 1 second when data is cached, within 5 seconds when data is not cached.
Channel change should take no longer than 1.5 seconds on same transponder, 2.5 seconds across different transponders.

The system architect has to work with the customer to identify reasonable targets for the major software components. Starting with just a rough slice of the stack:
Application Component: 20%
Middleware Component: 60-70%
Driver Component: 20%-10%

Depending on the experience of the system architects, and the knowledge of existing products out there, they would come to a reasonable compromise. 

The key is to agree on a realistic target based on existing products in the market, and aim to be 20% faster that the performance of all infield products, or prior releases of the same software.

Having worked out the percentage splits between components, we then created our own performance requirements for the Middleware, that we traced back to the customer requirements. That is, we took the System Requirements, for each System Requirement, specified the Middleware Component Requirement. We went as deep as we could, delving into the call flows of each Middleware component that is involved in the API calls. Each component would have its own performance requirement.

Continuous Integration & Testing Performance Requirements (System v Component)
At the component level, as a Middleware service provider, we had automated component tests, automated Middleware tests (that excluded the application component), that ran all the performance tests, on a daily basis - through our Continuous Integration system (which I've written about before). Once the targets were agreed with the customer, the customer had live access to our CI Dashboard, as well as our Performance Test Reports where we were being benchmarked across all infield decoders across all the hardware variants - our goal was to be better that infield ever was, even though our technology stack was more complicated and offered more features, the customer experience couldn't be jeopardised in any way. Which customer wants to have a worse experience that what he/she is already used to?? Every software upgrade must result in improved user experience, and not the other way around...new features should never cause a serious regression in performance, unless it is technically impossible to avoid.

So we would track performance issues as a major obligation, and whenever a change broke performance in a significant way, we would stop and either revert the change, or instigate and ORIT immediately to rectify the situation.

As a component vendor, such as a middleware provider, it was our responsibility to execute our performance testing, through our own Middleware Test Harness (which is a system test in its own right), and using the same CI engine, tested individual middleware components (Component Unit Testing). We also took it upon ourselves to do our own System Testing, which integrated the frontend Application, and executed the customer's System Performance tests - all this before we released to the customer (or sometimes in parallel).

Being a middleware service provider is an enormous responsibility, and ownership for quality is not someoneelse's problem. Don't pass the buck onwards to a System Integration / SI, or Customer Acceptance test team. As a middleware vendor, we took it upon ourselves to ensure we did everything we possibly could to guarantee a level of quality that would delight our customers.

In terms of System testing, we used our own internally developed Automation framework, that exercised the UI application, through automation, analysis through video analysis and timer profiles. We learnt a great deal through this, so much so, that when it comes to measuring timing and performance issues, beware of the additional equipment in your test set-up that could cause delays and latency jitter to your measurements.

Anyway, here are some pictures of the story:
Snapshot of Middleware Test Report done on CI (Daily), Per Release, or Weekend Runs
Sample Customer Test Report - Red showed we had a long struggle ahead

Concluding Remarks

Fine tuning NFRs is a requirement of systems engineering that cannot be ignored. Time must be taken to quantify in reasonable detail the specifics of the NFRs such that the requirements can be measured, controlled and tracked for regressions accordingly.

If NFRs are not specified, you enter the territory of vagueness, and have lots of useless philosophical discussions about Customer Perception about Sluggishness, that the development teams believe their implementations are acceptable, and that the system is behaving as expected, considering there are certain factors outside of their control.

Sometimes this is a fair argument from the technical team, although I would be really uncomfortable if I cannot see proof of design or system requirements that target performance characteristics as part of the architecture / design. At the end of the day, we are writing code - which, is powered by algorithms. Algorithms can be quantified (once the system has reached steady state, although I could argue that from system start-up to steady state should be in fact deterministic as well). Anyway, when it comes to performance issues, I would argue that algorithms need to be proved that certain operations will either be deterministic or non-deterministic. How long would it take to search through a database, regardless of its size? Around what conditions have we modelled the system behaviour, or anticipated user experience (Half load, full load, etc.). When is it likely to trigger system memory garbage collection, etc, etc.

On the topic of Garbage Collection, I will write about it in a future post. To conclude however, performance requirements must be factored into the design, which is likely to impact topics such as memory management (caches, garbage collection, etc). Topics such as non-blocking IO, lock-free pooled memory, asynchronous versus synchronous APIs, database locking, threading, process and interprocess communications -- are all topics that must be considered in the design, to meet performance requirements.

A lot of Sluggishness problems can be managed and avoided, through automated testing that cuts right through the software stack, down to individual component testing. When regression is found, it is dealt with immediately before it reaches the system level. If sluggishness issues are a recurrent topic that is seen in customer releases, then it generally points to not including performance requirements as part of the design/architecture or implementation.

No comments:

Post a Comment