Agile Product & Project Management Model - Large-Scale Set-Top-Box Development Project
A while ago I started writing about my experiences with managing large scale software development projects using the Agile philosophy of very Disciplined Agile Delivery (DAD). In order to do this, I had to first introduce some context & background in the three previous posts leading up to this one, the last post dealing with overall organizational challenges. Now the scene is all set to discuss the actual Agile model we eventually implemented, that continues to stand the test of time: 4 years and is still running quite strong to this day.
Some Background Material
Quick introduction to the world of digital TV
What is a Set-Top-Box?
What is an EPG?
What in the world is Sky TV?
The experiences I'm writing about is about past projects that I personally worked on or touched in my time a Middleware Development Owner & Middleware Products Manager, that are very much in the public domain:
BSkyB replacing OpenTV HDPVR with NDS technology
Fusion is the new NDS Mediahighway to displace OpenTV
BSkyB Launches new HDPVR
Sky Italia replacing OpenTV HDPVR with NDS Mediahighway
Sky Deutchland delivers enhanced VOD powered by NDS Mediahighway
UPC being powered by NDS MediaHighway
Take a look at this video clip:
Yes, this is the background project I was fortunate to work on, which forms the basis for my writing on a large-scale software development project. We were responsible for delivering the core software component (Middleware / Operating System) that enabled the EPG (UI Application) to be developed to offer the Sky Anytime+ User Experience.
On the surface, when interacting with an EPG, the user experience might suggest this was a simple application, compared to powerful PC software products like Excel/Word or Adobe Photoshop; surely it can't be a complicated endeavour to create this system? That's not entirely true: Beneath the seemingly simple UI is a complicated piece of software and electronics, so much so, that typical projects such as this one takes at least three years to complete (if all goes well)...
The project was greenfields development on all fronts for the software vendors employed, especially on our side for the Middleware component. But for the broadcaster it was a massively complicated migration project, because essentially we were replacing the full software stack of a legacy deployed based of 4 million+ STBs, with the option for rollback!!
The project, though massive in size, relied on a core product & architecture team to manage the roadmap and project planning; supported by a very large technical team (software engineers, integrators & testers) that were geographically distributed throughout the world. Having joined the project a year-and-half into its development cycle, then spent the next two years executing and driving through the implementation and delivery to Launch Candidate builds, I played primarily the role as Middleware Development Owner. I inherited the Agile founding principles of the Programme, which after two full years of practical exposure, resulted in a deep appreciation for Agile, such that all my subsequent projects going forward re-used and evolved the model based on those principles. I later joined the Product Management team that championed the processes through roadshows, training and ramping up new project managers on the "Project Toolbox" required to deliver all future customer projects...
This post is split into the following sections to make for easy reading:
- Recap Software Architecture
- Agile Philosophy - Call it Structured / Disciplined Agile Delivery
- Concept of Product Backlog and Work Packages
- Digging further into the Product Backlog
- Work Package Software Development Lifecycle
- Hybrid / Structured / Disciplined Agile Delivery
- Large Scale equivalents of Agile / Scrum Meetings
- Iteration Timeline Template (Blueprint for Heartbeat)
- Overall Development Process Model
Recap Software Architecture
The software that runs in these devices is like any other system software, made up of a variety of components, recap my post on DTV architecture, refer to the sub-topic of STB Architect, representing the following sub-systems: Hardware Abstraction Layer & Device Drivers, Middleware, EPG UI & Interactive applications.
Recall in preceding posts I described the Middleware sub-system as a collection of 100+ components. It was in this area of the stack that we focused our primary development on, having control of both the lower level hardware abstraction interfaces, as well as the upper layer, application interfaces. At first, we were contracted to just deliver the Middleware, but later on, also took responsibility for the UI application, and later took ownership of the overall STB System Integration. Our key goal was to implement a system that was essentially a Personal Video Recorder (PVR/DVR), with the ability to download content through an IP-connection, using Progressive Download (PDL), through a User Interface / Electronic Programme Guide (EPG) (Click here to see the actual system).
We chose to adopt a number of principles from Agile & Test Driven Development, essentially focusing on the following:
- Iterative Development, incremental functionality
- Continuous review of priorities, adjusting plans as appropriate
- Deliver incremental features and functionality to enable continuous testing, continuous integration & continuous delivery
- Always working code (with acceptable regression tolerances) demonstrable to customer, in regular intervals
- Endeavour to always deliver working software, to plan
- Concept of Sprints a.k.a. Iterations
- Iterations were fixed, with fixed start and end dates: timetable for the entire programme, over 40 iterations in total were executed to get through to a release candidate stage
- Having fixed start and end dates for each iteration was useful because the teams were distributed; as this allowed for these distributed teams to synchronise their activities (their own private backlogs & resource planning) -- it provided structure to not only implement the processes, but also allowed us to monitor, review and implement process improvements
- Strict use of a Product Backlog
- This was a simple tool based on Excel that exposed the full scope of work that had to be done to complete the project
- We used the concept of Work Packages a.k.a. User Stories that uniquely identified a piece of work that had to be done
- Centralised ownership & control of the Product Backlog
- There was no real Product Owner per se - it was a collective effort between the Project Management Team and Architects
Work Packages come in all different shapes and forms. The most common type of WP is what we called "Middleware Development & Integration" WP. A Development/Integration Work Package is a unit of of work that implements a well defined scope of discreet functionality, that can be developed and tested in relative isolation (via unitary testing) and integrated into the wider system stack for further integration testing (Ideally the work package should be one that can be developed and integrated within the course of one iteration timeline). We also had Integration-Only work packages for work that involved integration-only tasks, for e.g. Pull in the latest third-party component, Upgrade to latest drivers and test; or Implement more automated testing for function XYZ.
There were also task-driven work packages that were directives from the programme/product management team: WP to reach zero Showstopper Defects, WP to clean up Branch Coverage requirements as per spec, etc. We also had cross project work packages - example, if a common WP for the shared product codebase impacted several customer projects, then the WP is cloned for every customer project to account for the project-specific test & integration scenarios. As the software stack matured and we took on more work, the types of WPs evolved to include the EPG UI development, End-to-End Integration & User Experience. This was a natural evolution of the process, and in keeping with the Agile philosophy of implementing features cross functionally, across the full slice of the stack, which in the context of a digital TV system, is challenging to get right the first time.
Like any software product, a STB Middleware/EPG is bound by features it needs to implement. Features of course vary, some are simple, most are relatively medium-complexity; whilst a good few are massively complicated. In the Agile world these features are translated into user stories and ranked accordingly by complexity, assigning story points to each user story. Story points of course reflecting an arbitrary measure of effort (not duration). In the same notion, features were broken down into one or more work packages, each work package was rated by complexity, and estimated at a high level (man-days) for development/integration. Because we were a massively distributed team, and a large team size at that, we required estimates in man-days to help us with resource-forecasting down the line. Part of the project management's responsibility was to assess the load across all development teams, with forward planning to prevent nasty situations from happening in the future, i.e. we couldn't afford to get blocked.
Coming back to Features broken down into functional work packages: The aim was to break down a feature into functional units of work, called work packages (WPs) that allowed functionality to be incrementally implemented over time, implementing the overall feature. The work package was expected to be fully developed and integrated within the timeline of one iteration. Should the WP turn out to be a complex one that made it impossible to implement within the iteration, we were generally faced with two options: Break the WP down into smaller chunks, creating more work packages; OR spin-off development for that work package on a separate branch until it was completed and ready to deliver back into the main development tree.
The latter option went against spirit of the agile philosophy that we had in mind, but in extreme cases where the feature set itself was massively complex we allowed for certain deviations. What we did however, was settle for the former option of breaking down the work packages into manageable chunks of work that could fit easily into the iteration; and in most cases required a couple more iterations to see the desired functionality working. So we settled for at least getting the code developed, unit tested, integrated and tested for regressions, as long as there were no regressions introduced by the code-only delivery, it was safe. Taking a WP out on a separate stream of developed also incurred more project management and coordination overhead since we tried hard to maintain a centralized heartbeat, coherent and synchronized planning (more on this later).
The notion of having a centralized, prioritized backlog of work packages that could be shared by a development team of 200 people, and managed by a small product management team, can be achieved in reality but it does come with a massive management overhead. At times it made sense to split the backlog up into chunks itself forking off these backlogs to the regions that could would be better suited to self manage the development/integration in relative isolation, but still maintaining delivery of code back to the main development codebase.
For example, as we were implementing an HDPVR with PDL (High Definition Personal Video Recorder with Progressive Download Video On Demand through Internet), we introduced the notion of a Functional Sub Track (FST), which allowed a feature to be implemented in relative isolation of the main project. The FST had its own set of work packages which were added to the central product backlog for record keeping, but the FST itself was assigned an owner (FST Owner) was responsible and accountable for delivering the FST back to the project in the shortest time.
FSTs were not always approved since again this mode of planning fragmented the project team and incurred some integration costs (merge-back to main codebase); but when faced with tight timelines, we had to bit a few bullets, accepting a few FSTs as part of the project. We also segmented the initial PDL development stream, merging back to the main project at a suitable point in time. The reason why these FSTs/branches were generally subjected to tight scrutiny before approval as acceptable was that it placed upon the common component development/integration teams additional overhead of supporting what essentially were multiple project streams, by different project managers and architects.
Remember the components were common - code was common, sometimes the Work Packages would conflict, causing all sorts of inefficiencies, the classic one of design were architects would design a solution in isolation, providing conflicting interfaces.
This generated a lot of downtime, costing the project. Hence the philosophy was always to funnel the work packages through a centralized product management and architectural team to ensure consistency of requirements and design, by implementing the product backlog. So what was this Product Backlog then?
The product backlog was essentially the list of work packages for the project. The customer provided requirements in various formats: System Use Cases and Product Use Cases (PUCs). We also had access to existing implementation of the product that was used as the reference stack for comparison. In this company, architects are used to solicit requirements directly from the customer. Some companies employ Business Analysts for requirements gathering and writing up the technical specification. Our architects maintain the full role - solicit requirements from customer (written or spoken word) - convert those requirements into technical requirements that we use internally for the product - design the solution based on these requirements - handover to development team.
Starting with these use cases which described the requirements at the high level from a user or system perspective, the work packages evolved naturally out of this. Each PUC was mapped into one or more work packages. The PUC would be associated with a high level feature (which we called top-level feature), the feature and the PUC as prioritized by the customer which in turn drove the priority of the associated work packages. The backlog thus emerged as a list of work packages with the attributes mentioned below. The backlog was central to project planning and was used in multiple ways:
- It offered the high level view and representation that was communicated to the customer for reporting purposes. At the start and end of each iteration we had to share the backlog with the customer
- Detailed planning: the backlog was used to capture component estimates at high level for each component impacted by the WP (fairly detailed)
- Gathered data for generating metrics. Measuring the classic velocity and burn down rates was achieved by measuring WPs completed per iteration
- Resource planning: It was a central place to lookup the work remaining for all components, teams and managers
The product backlog was built on Excel (as highlighted in previous post). Personally I think Excel is the best application ever invented, it is so much more than just a spreadsheet (I've used many applications before even the open source versions, nothing comes close to matching the power of Microsoft Excel)…Essentially this is a list of work packages with the following fields:
- WP number/id: Unique number identifying the WP, e.g. WP647
- WP Title: Simple one liner for the title, e.g. WP647 Implement 3D-TV signaling
- WP Description: 3-4 lines describing the nature of the WP: This WP will implement the 3D signaling requirements. 3D channels will be signaled in the broadcast stream via service type in service descriptor. Box should automatically configure the TV to switch to 3D TV mode, signaling the UI to switch to 3D presentation mode.
- Top Level Feature: e.g. 3D TV Support
- Feature Priority: High/Medium/Low
- Scheduling Priority: Urgent - Now / As Planned (slot in backlog as per process) / Future
- Complexity: Easy/Medium/Complex
- WP Type (Development / Integration): e.g. Development, Integration, E2E Testing
- WP Dependencies: Does this WP depend on anything e.g. 3D requires broadcast headend to be setup, or recorded streams
- WPO (Work Package Owner) - Technical Owner responsible for managing and driving this WP to completion
- WPA (Work Package Architect) - Architect assigned to design and scope this WP
- WPLI (Work Package Lead Integrator) - Middleware Integrator responsible for integrating impacted components and proving the WP is ready for System Integration
- WPSI (Work Package System Integrator) - System Integration engineer responsible for end-to-end system testing
- Components Impacted - List of all software components impacted
- Planned Iteration: Which iteration the WP was originally planned for (Iterations number from from 1 and incremented, e.g. IT32 is the 32nd iteration to date since the start of the project)
- Actual Iteration Delivered: Tracks which iteration we actually ended up completing the WP (was useful in measuring planned versus actual)
- UXO (User experience Owner) - Person responsible for generating the User Experience requirements/design for the WP
- UI Architect - Architect responsible for the UI component design/implementation
- WP Status - Is the WP approved or not (WP requests come from anywhere, no restrictions - but WPs in the end are subject to review and approval).
Recall in a previous post I mentioned using SharePoint: We implemented a Backlog Requests workflow that allowed anyone the platform for submitting new work package requests for adding to the product backlog. These WP requests were reviewed and either accepted or rejected. Accept WPs would make it to the product backlog, which would then follow the WP review and planning process as part of the regular backlog management review meetings.
Although the software components were self managed by the respective component owners, component owners had to expose the internal work or improvements being done to the product management team, especially during critical execution stage of the project - component owners did indeed have autonomy to self manage their components as long as the risks were understood and acceptable by the management team. The project drew a hard line between Integration and Development, essentially, the project was owned and controlled by the Integration team.
A typical work package was a development/integration usually instigated by a STB architect. The STB architect would raise a backlog request on SharePoint, providing Title, Description & background into the WP request, along with a list of suspected components impacted by the WP. An architect might be proposed as well as the desired iteration for this functionality to be implemented. The product management team receives this request, adds it to the backlog.
During the backlog review process we find out more about the said WP, and if approved, assigned names to the various fields as mentioned above; and a tentative iteration number would be assigned to this WP. We always tried to maintain the process for allowing enough time for a WP to be assessed and taken through the motions (more of this later). As part of the backlog review and pre-planning, the backlog for the next iteration would be reviewed at least 8 weeks in advance of the start of the next iteration. At this point we generated a draft backlog for the next iteration with the list of WPs and the associated owners, primarily the WP architect and the WP Owner assigned to each WP was made clearly visible. A number of pre-planning activities had to take place before the start of the next iteration for each WP, highlighting the fundamentals:
- Exact requirements for the WP had to be solicited and clarified by the architect
- Architect produces a Work Package Design Specification. This is a design document detailing the requirements and component design (at interface level) for all impacted components. The WPDS also highlights the testing impact from architecture perspective
- Architect submits the WPDS for a design review with development component owners and integrators
- Component owners review and feedback comments to architect
- WPDS is reviewed again for finalization with component owners & Integrators
- Work Package owner gets estimates for each components impacted and the test/integration effort
- WPO gets planning dates taking into account the timetable for the iterations (would the work fit into one iteration or not)
- Integrators produce associated test plan and Infrastructure requirements specification for the WP
- WPO feeds back initial draft plan to product management team as input into overall iteration planning
- Component Owners draft APIs required, update component requirements and component design documents from WPDS
Once the work package has been specified, reviewed and planned, it enters the iteration where the following is achieved:
- Component owners publish the APIs within the first week of the iteration to allow dependent components and integrators to start coding against the new APIs
- Integrators write test cases and develop test code in parallel with component development
- WPO coordinates the various development streams - once all components are development complete, the WP enters the integration phase (this is all within the same iteration)
- Integrators run the WP tests, feedback defects back to component teams for immediate resolution
- When integrators are happy the WP behaves as specified, the code is delivered and WP is marked as complete, product backlog is then updated as WP is done
I left out the details of the actual development/integration methodology, which is a topic for an entirely separate post.
Concept of a Work Package Owner (WPO)
The product backlog consisted of more than 180 work packages (WPs). Each WP could impact between 2 and 14 components. Components must be developed, integrated, tested and delivered back to the main codebase as part of the WP development & delivery lifecycle. So each WP then requires some significant management, more so around managing the technical teams. It called for taking ownership of the WP from inception right through to delivery, pretty much like a mini project in itself as a WP goes through these stages: Requirements, Scope, Planning, Controlling/Tracking, Managing Risks/Issues, Delivery, Closure.
We could not afford to have too many project managers, WPOs had to be strong technical people who understood the technology and requirements of the WP. So WPOs were generally assigned to Development Managers who reported to the Project/Product Management team.
The following list highlights some of the responsibilities placed on a WPO:
- Regularly Review Backlog updates noting down WPs assigned and targeted Iteration delivery
- Track Architectural commitments (Architects for your WPs must completed WPDS on time)
- Assist PM team in managing architects to produce WPDS
- Manage WPDS Reviews with component and integration teams.
- Raise issues to the PM, identify WP hardware, stream, CDI and environments requirements if not already known
- Manage Component API changes with Component Owners, timeous communications of API changes
- Create draft plans for IPR with inputs from all impacted teams
- For the IRR, review WP Plan, baseline & commit to your WP plans (all open issues closed)
- Ensure Configuration Management processes are followed during WP development
- Manage WP dependencies and co-ordinate WP delivery timetable with component teams, PM & Integration
- Synchronize the early delivery of dependent components to enable other components to continue development
- Provide weekly updates to the PM team on WP progress and raise daily issues if needed
- Work closely with integration team in managing resulting issues, resource bottlenecks, etc.
- Actively monitor and prevent slippage of WP, make recovery decisions/proposals to PM team
- Maintain project's criteria for quality - minimum regression introduced by WP
- Feedback process improvements to PMs, contribute to retrospectives & lessons learnt
When I explain this process to people who have just started experimenting with Agile, their initial reaction is that our methodology was heavy-handed, waterfall and bordered on being dictatorial which flies in the face of what Agile promotes as lightweight, flexible & adaptable, with autonomy. IMHO when it comes to executing a large-scale software development project, with a team of 200 people geographically distributed across the world, you try your best to maintain the agile spirit; but you have to take a more structured and controlled approach.
We were flexible and adaptable: We had regular backlog review sessions where we constantly re-prioritized based on the latest feedback from the customer. Autonomy was controlled through a strong management team, individual component teams were free to adopt whatever agile tools they chose, they could also have their own localized component backlogs; as long as the code they introduced did not cause regression as well as the people doing non-project work was approved by senior management - projects were of course given preference because R&D cannot happen without a steady stream of revenue, focus on delivery first then time will be available for follow-on R&D.
I have written previously about the misconceptions of Agile (see here and here). This mentality comes from lack of maturity in applying agile; as well as lack of experience in the field of software engineering. Agile is not an excuse to cut all corners. Agreed there is a preference for light documentation, preference for working code, etc. but nowhere does Agile mention not doing requirements, not doing design, not doing continuous integration using code-level unit testing, etc.
The process we implemented ensured there was continuous flow within the iteration, in parallel with development/integration in the current iteration, the teams were forward planning the next iteration weeks in advance. At the start of the iteration, teams would have all the information to hit the ground running. Two weeks before the start of the next iteration would see all the work packages nearing completion with the design specifications; the teams would have reviewed and estimated. In the second week prior to the start of the next iteration, we held a pre-planning review with the entire team (represented through the work package owners and component team leads) that lasted between 3-4 hours in duration where we'd review the original backlog and assess how far we progressed. If during this preplanning session it emerged the workload was too high, we'd agree to drop a few work packages to align ourselves more realistically with the existing velocity at the time. The week before the start of the next iteration, we'd hold another 3-4 hour planning review session where we'd baseline the plan going into the next iteration.
Baselining the plan was essential, and is a classic project management method to ensure commitment and buy-in, locking in the plan. The baselined plan is tracked daily throughout the iteration period, just like the standard Scrum practice of daily stand-up, except that we were much more demanding in that no slippages should occur (because of our commitment to the customer and the timelines for delivery were critical). No new work packages were added to the current iteration unless the customer really shouted (customer played role of Agile Product Owner) which was a rarity in itself. At the end of every iteration we'd produce a build that was demonstrable as functionally complete with no major regression, deliver a working test harness to execute the functionality as well as a demo application showing the new features exposed. As the project matured, we evolved to include the full stack functionality with a real EPG application.
We were doing incremental development. We did continuous integration. We relied on automation heavily. We implemented fixed iteration duration or time boxes. We owned a product backlog. We had pre-planning sessions. We continuously reviewed and improved our processes…So how can this not be labelled as Agile?? We did not have Scrum Masters, we used Development Managers as Work Package Owners, Project Managers for the planning and tracking the baseline iteration plans. We used the RAG status to convey current progress of iteration. We implemented risk management daily during the iteration. We produced Gantt looking charts and plans to better visualize the programme and iteration timeline. We controlled changes with a strict Change Request system. So…weren't we just implementing strong project governance as well?
In a nutshell my message is that for a large-scale software development project with a team of 200 people geographically distributed throughout the world, Agile & sound Project Governance can be implemented simultaneously, they are not mutually exclusive. Project Governance is often incorrectly associated with the Waterfall development model…what people forget though is that within every Agile sprint, there must be a bit of waterfall implemented.
To coordinate and manage the processes with a large distributed team, we had to implement some regular meetings that were core to maintaining the stability of the processes. In this section I'll briefly highlight the meetings of relevance - in the end the entire team were so tuned to the processes that they sometimes reminded the project management team about upcoming meetings. As we proved the processes on this first project, later projects adopted the same - so much so that the product management team have created a toolbox for project managers to adopt when implementing a new customer project:
- Backlog Reviews: Done regularly, at least twice a month
- Iteration Planning Review (IPR): Two weeks before the start of the next iteration, a planning sync point to review the progress of new WPs to date, aligning the backlog for the next iteration, tabling any actions or open issues to be resolved.
- Iteration Readiness Review (IRR): One week before the start of the next iteration, this is the final planning session to baseline the iteration plan for the following iteration, closing actions/issues raised during the IPR.
- Iteration Progress & Tracking Update: Weekly meeting with WPOs and tech leads to track the current iteration's progress against the baseline plan. Also an opportunity to review risks and issues.
- Daily Top Issues Call: Similar to daily stand-up but at the macro-level where we focus on the top issues across the board for the project including top blocking defects, work package progress and top regressions
- Iteration Regressions Review: Review and track the progress of recent regressions introduced with latest code changes. Assign owners to issues and plan for fixes such that regressions are cleared up before the end of the iteration
- Daily Defect Scrub & Reviews: Different to regression review, this was a daily meeting with the customer to review and assign new showstopper defects that had to be addressed in next release (this was during the integration/test phase when development was largely complete)
- Iteration Defect Review: Two weeks before the start of next iteration, initial list of defects to be addressed in next release are reviewed with the customer (transition from development to integration/launch phase). Some people refer to this as "Bill of Materials" or "BoM"
- Fortnightly Product Steering Group Review: The powers-that-be held this session to review progress to date on the product, including customer projects. Processes were reviewed here. Big changes or enhancements the team proposed had to be vetted and approved by the steering group, the only platform to introduce change in product management processes (e.g. the FST process as described earlier, branching philosophy, project priority and resource conflicts, etc.)
- Weekly Code Quality Review: Code quailty was taken seriously - this was a weekly review to catch any regressions in coding standards agreed (e.g. MISRA warnings, incomplete unit tests). Again, results feedback into the current iteration for component teams to implement
- Retrospectives / Lessons Learnt: Were not done as regularly as it should have, but done on a needs and case-by-case basis. Because of the size of the team and nature of project, retrospectives highlighted the issues, management reviewed and addressed the burning ones ("low hanging fruit") but generally took in requests for process changes with caution (since processes take time to mature it's not recommended to introduce further change at the first sign of trouble)
- Work Package Design Specification Review & Estimates: Architect holds technical review of WPDS with impacted component tech leads; WP owner manages the planning/estimation of the WP. Aim is that all WPDS reviews are completed two weeks in advance of next iteration
- Monthly Integration Forums: The integration team was split across multiple regions as well, all executing the same integration/test/delivery model. Regular comms allowed ideas to be exchanged, issues raised, etc - feeding back to overall improvement processes
|Iteration Planning Template|
At first, creating this template for each iteration was rather time consuming as it was hand-drawn in Visio. This was later automated still using Visio, enter the start date of the iteration, then all milestone dates are worked out automatically.
In this section I'll attempt to illustrate visually the various streams of activity that happened to make the iterations successful. By now I hope you get the idea that this project was pretty much like coordinating an orchestra symphony, synchronous, in-phase all integrating timeously.This is still TODO, will take me some time to draw this visually so I've released this post as a beta for now. Stay tuned...Read Part 4: Development/Integration Processes...