Wednesday 14 June 2023

A blast from the past: my experience building a large-scale tech platform

In the years 2003-2011, I worked for a pure technology service provider, NDS (acquired by Cisco in 2012, then later became Synamedia) which was considered at the time, the world leader in end-to-end digital TV software systems. I was fortunate enough to experience as an engineer every major area of platform development for this complex ecosystem; and then later as a software manager, I would own the software delivery for a core piece of the software stack known as "middleware", for NDS's primary anchor customer BSkyB/Sky Darwin and then later would own the full stack delivery of NDS's flagship Mediahighway Fusion/Unity product. This experience would mark my entry into very complex large-scale technology delivery initiatives, which even to this day, thirteen years later, as I work with the world's largest cloud provider, Amazon AWS, in building out its enterprise cloud support systems (AWS Support Center / Technical contact systems), Fusion still takes the prize for the most intense professional experience, learning and growth, technical complexity, risk and high-stakes projects. So yeah, I find myself having to dig deep into my memory to recall this work experience because it's funny that 13 years on, I'm encountering the same topics of engineering management even though it is supposed to be a different domain, turns out "software is just software"!

NDS had captured almost every top-tier PayTV operator around the globe at the time: Sky, DirecTV, UPC, Sky Italia, Sky Deutschland, Foxtel, Sky LA, Yes, Bharti, etc. NDS was prominently known for its conditional access product, a video content protection system call NDS Videoguard, however, NDS offered more than just security and offered customers a fully vertically integrated ecosystem (think "Apple" ecosystem for PayTV customers). Whilst digital TV was built on open standards and interoperability, most customers limited their integration points. So when they opted for NDS as their security provider, they also had the option of integrating all other services - from broadcast backend services in the headend to consumer device hardware development and software service integration with chipset vendors. The consumer device software was known as TV Middleware. At the time, the main players were NDS Mediahighway, OpenTV & TiVo. NDS was known for convincing customers to migrate to NDS Mediahighway, its technology migration programs were demanding, complex and executed flawlessly. As an engineer, I contributed software to replace TiVo, an overnight win for 40 million devices. Later as a software delivery manager for the Sky Darwin migration project, we would replace OpenTV software almost obliterating its presence from Sky, save for a few ancient, ageing hardware profiles.

NDS, with an increasing number of customers using its security, middleware and application services, couldn't afford to scale out with engineering teams for each custom build. A platform strategy was needed, consolidating the best of software from across the globe (US, UK, India, Israel, France) into a new shared technology stack, that offered flexible customisation and tailoring for any type of customer profile (Tier-1 customers like Sky for advanced applications to Tier-3/4 customers in territories just starting off with basic digital TV), using a shared engineering resource pool - and extensible configuration engine for producing tailored custom releases. So was borne, NDS Mediahighway Fusion.

The flagship customer for Fusion was Sky, which went live in 2010, replacing up to ten variants of its consumer device software services, with new Fusion components and Sky's own custom-developed consumer application "EPG" known then as the "Orchid EPG". Fusion provided an SDK/API for customers to develop their own primary applications, along with an interactive HTML engine, that allowed PayTV operators to add additional mini apps to their devices, like games and weather apps. With Sky being the anchor customer, Fusion had proved itself in the market and thus was ready to onboard new customers like Sky Italia, UPC, Foxtel, Yes, etc. Post Darwin launch, I took the lead for building the new platform vision, called Fusion Snowflake EPG through project Sunrise - birthing the platform that would create customer, tailorable configurations for any customer, maximising reuse and minimising customisation but allowing for a selection of custom user experiences.

Why am I claiming Fusion as large-scale (even in 2023, 13 years later)?

I write this in 2023, after spending 2.5 years with Amazon AWS. I am part of the group that build AWS Support Center and related Contact Center services. We are a team of under 100 people, deemed large- scale and building complex systems. Yet, if I have to be brutally honest with myself, I'm mildly impressed by my exposure to date, because my current work pails in comparison to my work on Fusion, 13 years ago. Yes I know it's a different domain, a different paradigm and culture of Amazon's 2-Pizza team model for software product ownership (which I actually find quite cool)...still I'm finding it hard to rationalise my move to AWS almost 2.5 years on, have I gone too far backwards? Am I living too much in the past & not ready to view things from a new perspective? What am I not seeing? (Topics for another post). So whilst I've defintely adapted my mental models since joining Amazon, yet I really can't ignore some software engineering truths which is the reason for my bringing up the past now. 

In 2012, I wrote the first story about Fusion, introducing the term LSSDP I coined to mean Large Scale Software Development Project. I also dived deep, writing lengthy white papers about the product and engineering management processes:
Fast forward to 2023, now using my Amazon AWS experience as a lens for defining a large-scale initiative and indirectly checking engineering manager role guidelines for large-scale:
  • Business Impact - Fusion started off with a $75 million investment and later a joint-venture with the flagship customer, Sky. The entire company pivoted to focus on Fusion as its next-generation software platform, with up to 3000 engineers world-wide working on multiple streams, some strategic foundational streams kicked off at least 2 years before the mainstream program. In my role as software delivery owner for Sky Darwin project, it was critical the project delivered successfully, flawlessly - as it involved migrating software in 10 million people's homes (their living room TVs) seamlessly with no rollback. To the end customer (the person sitting at home watching TV), they would notice very little change to their experience. Overall, Fusion software components delivered to multiple middleware stacks, at the time of 2011 when I departed NDS, our software was running in excess of 60 million people's homes daily, globally.
  • Scope and Size - Fusion introduced a new paradigm of the TV software ecosystem, end-to-end, including broadcast headend components as well as embedded software architecture. The stack was open, based on a Linix/Posix and a complete departure from the initial decade of TV software operating systems. This was before the advent of Android TV or fully open source middleware. Fusion's product backlog captured over 2000 epics in the form of work packages, cutting across multiple customer needs, in parallel. The scope included all layers of the device software stack: Chipset drivers, hardware absraction layer, Linux kernel, Linux abstraction, Middleware services, Application SDK/APIs, multiple frontend application engine proxys for C / C++ / Java / HTML / Flash applications. Take a look at the software architecture diagram - it is multi-layered, multiple service teams. Another point on scope, we managed initiatives or epics in the form of work pacakages (WPs), that could impact up to 25 service teams in one WP, see here.
  • Team Size & Geographical distribution - Fusion was a world-wide globally distributed initiative, with development sites for core software services distributed across: UK (Southampton & London), France (Paris), Israel (Jerusalem), India (Bangalore). Hardware / CDI / chipset low level drivers teams out of Korea. We also had sales and account teams in US, Denmark and Australia. Overall team size was 2500 engineers associated with Fusion initiative. In my Darwin program, we had around 350 people split across the sites for software, along with the customer Sky, having about 200 people working on it. From a management team, we had about 50 managers, principal engineers and solution architects and from C-Suite we had to report into at least 5 senior executives. I ran a top-issues call, daily for 2 straight years, going through risks and issues, delivery timelines with all the senior management stakeholders. I was only 32 years old (and a father of 3) at the time, truly grateful for the experience. With 161+ services, Fusion consisted on 3 chief architects reporting to the CEO, with 20 global architects supporting the stack: France owned 35 services under 5 software managers; India owned 23 services under 6 software managers; Israel owned 33 services under 12 software managers; UK owned 58 services under 7 software managers.  My role as Software Delivery Owner meant controlling all parts of the delivery cycle from backlog management, to release planning, from ensuring architecture design to integration testing, from customer defect triaging to field trialing software. I ran everything - and as such, I had a deep technical understanding of all the software and services of the stack...back then I was deeply technical (strange, after 13 years I'm now resisting the urge to get back into the weeds).
  • Duration - Fusion took five years to deliver its first major customer delivery, Darwin, with Darwin coming on-board as the flagship customer 2.5 years into the strategic program. I joined Darwin in 2008 and delivered the first release, the first pivotal migration that replaced 10 million consumer devices, across at least 8 different hardware profiles, in June 2010. Shortly after Sky UK, we launched Sky Italia, then won a bid for UPC/Horizon (pan-European deployment for Liberty Global nextgen box), a 9 month timeline!
  • Dependencies - There were a number of moving parts in the new archictecture because it was a big departure from the past. Refer to software achitecture. On the customer prgram side, the picture is not so different to the one shown for DStv Explora here. In the case of Sky, we had to replace software seamlessly with zero rollback for a number of different devices that were up to 10 years old, running different software, operating systems, had different driver behaviours and low-level bootloader code (firmware). NDS created a CDI (Common Driver Interface), the nexgen version of Mediahighway HDK (OpenTV HPK or Irdeto Middleware HAL). Not only was bootloader stages needing mods, but device manufacturers had to upgrade to CDI for compliance. The hardware readiness was a big dependency. At the software side, Sky had 50+ OpenTV applications written by a number of indepedent 3rd pary app developers that either had to be ported as a last resort, with first prize being a seamless replacement of OpenTV engine (which we accomplished) without changes to apps. We had multple vendors doing systems integration as well. Middleware and Application features could not be tested without the underlying services ready - so as per the software stack, the dependencies increased as the team went higher in the stack. The customer also owned its own application development team, often last minute seeking change requests for new features that were never part of original scope.
  • Risk - Technical risks abound from the obvious: is it possible to migrate millions of customers seamlessly without rolling back? The timeline was reduced and pressure was immense. Requirements changed as the project progressed. The technology risk again was new platform, new design, not proven in the field. There would be performance and stability issues to address. Lots of money invested. Reputation risk. Engineers were working overtime, long hours. There was lots of parallel workstreams to coordinate. There was also politics at play because Fusion was going to end up other middleware project pipelines - we had to ensure other projects weren't unnecessarily extending "legacy" middlewares that creating widening gaps fro Fusion to close on parity.
  • Budget - Both initernally and externally, Fusion and Darwin became the flagship, CEO-driving project untl it delivered. I suspect upto $500m must have been spent, if not more.
  • Stakeholders - There number of stakeholders were diverse. Thankfully at the time, I didn't have direct access to the end customer, although I ran the issues/risks call, sent emails and communicated progress reports, my stakeholders were the CEO, CTO, SVP, VP and Directors at the time. Fusion leadership team though had a number of C-suite pressure from customers because Fusion was delaying some of their plans.
  • Integration Points & Complexity - Too many to list, according to the software architecture, every major layer has an integration point. With over 100+ core services, each service exposing an API, that any upstream/sidestream could consume - makes the integration quite heavy. This is why, in 2010 we had invested heavily in Full CI/CD with robust Software quality checks based on Misra. We also had to build a simulation environment to work ahead of the hardware. So the Fusion OS / Linux kernel along with drivers for the platform were ported to a windows and Linux desktop environments so that engineers could work completely independently of hardware. The Simulator itself became a target device for CI/CD builds! Even in 2023 terms, the focus on engineering excellence was bar-raising, as Amazon would say!
  • Change Management - Fusion resulted in a complete overhaul of the global engineering team org design, as well as the customer engagement was entirely new (customer had access to the full source code, documentation, test harness, etc). Our release process included the delivery of the entire environment such that the customer could recreate the build, run tests, validate our test results, compare releases, etc). Who even does this today? This change management and audi process took Escrow to another level. In addition to this, we implemented strict change control mechanisms to manage scope, adding a cost element - charging for every new change request before considering to resource development.

Story-telling through slides, 13 years later...

As I reflect back on this beautifully large-scale project, I marvel at how advanced we were at that time, in 2010, building a highly complex software stack, with a geographically displaced team across 5 countries, using mechanisms like fullCI/CD, that even today in 2023, companies struggle with. In my world of AWS, we build distributed services for AWS multiple-regions. A simple mental model, taking me back to TV software - yes, we too built multi-region releases. Our software ran on multiple hardware device configurations, some hardware less advanced, some more advanced - different chipset vendors, implied releases differed according to platform constraints. This isn't that vastly different to deploying cloud applications in constrained regions. We ran full CI/CD pipelines, deploying to multiple geographic regions under satellite footprint of broadcasters. Our tests ran daily, overnight, weekly, stress and performance / load testing - continuously in-region. As soon as tests failed / regressed, depending on the level of testing, alarms would go off, stopping work-in-progress, fixing the build - deployments frozen until the main pipelines are green again! We did this back in 2010. Today, in 2023, my teams in AWS do the same, albeit with advanced automation, logging and tooling infrastructure. But the essence is still the same. Software quality is software quality, software engineering principles remain consistent no matter the domain. As for advanced project, product and platform management - the mechanisms that embodied Fusion platform requires a substantial lift and discipline. In my world of AWS today (building bespoke enterprise tooling for AWS Support businesses), such a software product engineering factory mentality might not work - unless there is a serious intent on adopting more formal methods of product management, like we did for Fusion.

No comments:

Post a Comment