Mo Khan's Outlet!: On Cloud Transformation, CTO reflections on scaling tech & people - Part 1

In a series of posts this year, I plan to write on how I led a transformation of a technology platform and engineering team - and delivered results in scaling to 10X+ growth on KPIs such as user-and-device growth, user engagement, enhanced personalisation & content discovery, reduced platform instability by increasing availability from 97 to 99%; created a 20X+ reduction in core operating costs (saving R100m+) and simultaneously built a scalable leadership team to take over. All in 3.5 years.

This draws on my professional work experience from March 2017 - October 2020, when I spent my time as CTO (Chief Technology Officer) of an OVP (Online Video Platform) for Africa's largest VE (Video Entertainment) provider. In this short period my team delivered end-to-end transformation (not just software development) that set up the IT/Engineering to scale for future growth. I also left a scalable leadership pipeline in place which allowed me to comfortably transition to my next role outside of video systems by leaving a technology roadmap delivery plan & sustainable processes in place for at least another 2 years. Since my departure, I remain in contact with the team who continue not only to thank me for the roadmap but also for the opportunities I helped create to grow their own careers as leaders, who are themselves on track to become CTOs & CIOs as well.

Context

The business grew increasingly concerned about their online video platform's ability to scale to increased forecasted demand anticipated as more customers switched from traditional broadcast satellite-TV viewing to on-demand, on-the-go-viewing through streaming video over the internet. With incumbents like Netflix & Amazon Prime Video and others entering the African territory, we also needed a reliable internet-ready TV product that customers have come to take for granted. Until then, the video platform which was built largely in-house by a local engineering team, had an active user base that consisted mainly of early adopters, an internal start-up project, fledgling at best.

This platform and product was still in its infancy, not-yet-ready for exponential internet growth, which could happen at any time. As such, this team operated on a shoestring budget of a constrained start-up for a number of years. The engineering team was also spread quite thin, working on multiple, incoherent projects, product and services not specifically focused on internet video. Rather, we were the "online people" that did everything from hosting websites, various content management systems, wrote media apps and ran operations for digital marketing sites across African continent. In short, I owned a digital IT shop that was a multi-armed, multi-headed hydra that needed taming. Such constraints whilst bred out of necessity, unfortunately ignored the bigger picture, long-term strategic investments needed in the platform to scale for future growth were largely ignored because of budget constraints. The platform barely supported its early adopters in so far as providing consistent availability was never guaranteed or reliable. Customer satisfaction scores were very low, below 4 (<40%). Net Promoter Scores (NPS) almost non-existent. Outages due to platform stability was the norm, with on-call load on support engineers increasing as number of users began to increase.

The business, unawares of the true state of the platform, had nonetheless planned increased marketing and awareness campaigns for its internet streaming product. Large events like the FIFA Soccer World Cup (2018), Olympics (2018) Cricket World Cup (2019), Rugby World Cup (2019) sparked much concern about the platform's ability to scale for increased load. Other events like UEFA, Premiere League, Game of Thrones and other popular video content expected to bring increased traffic to the platform. Apart from primary content drivers, the fear of not making a noise on the streaming side was high - we needed to take this fledgling product & platform and make it mainstream. Marketing increased. Along came a decent technology budget assigned to me to help turnaround & deliver a recovery. I could not pass up this opportunity to test my skills in technology, engineering, strategy, delivery and leadership...what a journey it was!

The ask: build and scale an video streaming (live broadcast and video-on-demand like Amazon Prime Video or Hulu) platform to work across 50+ countries on the African continent, with localisation. Build, stabilize, replace, buy, partner - do what is necessary, but we don't have time-to-wait a year for new R&D or migration, as we're going to make a noise in marketing, so the platform better be available. At the same time, build complementary services for "internet connected set top boxes" in addition to pure online play.

The current reality, i.e. existing state-of-affairs in April 2017 when I took over, did not bode well for the team or technical platform. In addition to this, the organisation was half-way through implementing a business-wide transformation program "Future Fit", radically changing its operating model end-to-end. This led to changes in structure, organizational operating models: competency and functional area "consolidation" that led to layoffs due to optimisations. Managing big people changes whilst planning to turnaround and deliver a technical strategy was no easy feat. Eighteen months later into the operating model transformation, our team was again reorganised and reconsolidated as the business decided to IPO, rearranging its operating model yet again. Not to mention wade through 4X changes in executive leadership, retain stability in people and still deliver results. Changing executive leaders four times meant I had to deal with 4 new bosses, each CEO having their own view of technology strategy (more on this later)...

Our Tech Stack that we evolved to

....and a sneak peek of under-the-hood, infrastructure/platform services:

Technical & Leadership Challenges

So the challenges & obstacles I had to overcome as CTO:

How do I bring structure into a chaotic engineering team, stabilise the platform and simultaneously scale to 10X++ MAUs, double user engagement, with a small team and still deliver a comparable experience like Netflix?
How do I do this without micromanaging my team and build self-confidence in their engineering ability, despite being considered not “world-class” by sceptics (underdogs - I hate it when (international) folks doubt local African engineering talent - rant for another day)?
How to change a slow corporate processes to innovating fast and taking chances to migrate to cloud - without having all the answers yet and tick all the due-diligence questions?
How to maintain customer excellence when the stakes are high, and when chips are down, how to gracefully deal with a crisis (almost every weekend)?
How do I build credibility back when the team’s reputation has taken a hit?
How to maintain some level of platform stability where known bottlenecks exist that will cause an outage under high load or peak demand like a Black Friday event?
How to decide whether the existing platform needs a full rewrite or be replaced by COTS or whether it is safe to continue existing platform through technical-debt re-engineering?
How do you transform a "POC" prototype into an industrial-strength video platform that can stand the likes of the giants?
How do I transform a mindset of an engineering team and change the culture to a high performing, consciously customer-focused operations/security devsecops focus?
How do I transition the org to a devops model by changing mindsets, introducing habits and behaviours, then tooling and ultimately transforming team process and composition?
How do I create a technology vision & long term view that's credible & a practical roadmap for delivery?
How do I repair dysfunctional relationships between product-and-engineering teams where trust is low, cynicism high, unhealthy partnership?
How do manage business and marketing operations, building sincere relations to work around constraints and limits of the platform (shared context)?
How do I communicate the true state-of-the-platform to stakeholders without causing panic and have a credible action plan that instils confidence?
How do I build a leadership team when headcount is frozen and we still need to deliver? How do I remove doubt from senior stakeholders about engineering skillset and competency of managers?
How do I convince people about taking a chance on diverse individuals who may not be experienced or "qualified enough"?
How to handle political conflict, fears and uncertainty when merging two tech platforms, create a vision for the future and manage communications with non-technical platform why "it won't work" scenarios.
How do I leave a leadership succession pipeline in place seeing my goal was to leave after 3 years anyway?

Topics Covered

In a series of posts starting August 2021, my goal is to share my strategy, tactics and lessons learnt on leading this turnaround as well as core topics on cloud transformation. As I post each article, I will update these bullet points with links:

Take time to understand current reality & publish to-be aspirations - get an independent opinion by way of a system audit & architecture review
How I designed the technical org with a view for future organic transformations in mind. There's the ideal and then there's the "play the cards you're dealt", but never lose sight of the bigger picture.
Being bold with data centre optimisations - down with data centres, embrace peering!
First things first, stabilise your infrastructure and networking, build in redundant routing
Sometimes you need to take a step backwards to move two steps forwards - enhance platform software for multiple datacentres. First principles matter, a lot.
Stabilize hosting infrastructure and step change your virtualisation model - move to containers
Embrace cloud content delivery networking (CDNs) - don't build your own, partner. Introduce multi-CDN strategy, experiment, iterate.
Have the leadership will to experiment, fail and re-invent - leverage experimentation with partners. Change the paradigm (cache-at-edge, build safe modes, borrow from metaphors from other domains).
Choose your cloud partners wisely but don't over analyse to death. Move fast, choose an appropriate cloud platform that aligns to your culture.
Factors I consider important for partnerships
Going cloud native - out with the old, in with the new, skip lift-and-shift altogether - content discovery & personalisation
Keeping the bean counters happy - Introduce econometrics - how I introduced a platform economics model for Total Cost of Ownership, cost per user, cost of networking per user to prove the logic of transformation. Speak the language of finance to finance people. A CTO without financial intelligence is limiting.
How to improve technical operations, introducing habits, discipline, accountability & ownership.
Nurturing a culture of automated testing, continuous testing on production, load & performance testing, chaos testing, failover & disaster recovery testing. Testing is a job not just for QA department.
Managing large-scale events, scaling for peak usage & crisis management when things go wrong
Leveraging off-the-shelf capabilities - don't reinvent the wheel unless you have no other option.
Willingness to learn from mistakes, taking responsibility for outages & setting direction on way forward - leadership.
Know your metrics - how to go from zero to 100+ (and perhaps overkill) in creating a command centre for monitoring issues in real-time. Why every technology platform needs its own Platform Intelligence Portal outside of Business Intelligence reports or a Data Warehouse team. Engineering teams must own their platform metrics, don't pass responsibility over-the-wall.
How to build a succession pipeline in place such that exiting CTO position behind is ready to be filled by your number one?

Results Delivered

Here's a dump of the results I delivered in my tenure as CTO - that I use as talking points in my CV, interview loops & coaching other IT managers:

Inherited technology stack with weak foundational architecture, massive technical debt causing instability
Successfully turned around a previously distressed, dysfunctional group, bringing clarity of focus & team cohesion
Redesigned the organisational structure & technical platform strategy aligned to future business growth
Introduced new leadership roles, set the vision, mission & objectives for the division thus bringing order to chaos
Improved relationships and working agreements with core customers by agreeing SLAs & metrics for performance
Improved platform availability from 97-98% (7-10 days yearly outage) to 99.8-99.9% (8.76-17 hours)
Scaled the platform to 10X+ on monthly active users, total active users/devices
Increased engagement at least 2X year-on-year
Increased network throughput 10X breaking internet streaming records for Africa, up to 800Gbps on one event
Expanded application device footprint 4X (introduced smart TVs, game consoles, new set top boxes, IP-only STB)
Enhanced project and release management processes, ways-of-working improved resulting in on-time completion of projects
Transitioned teams from textbook agile scrum methods to more fluid, generalised project execution tracks improving flexibility
Introduced Google’s Software Engineer in Testing capability, driving automation & completely removed manual testing
Instilled a sense-of-ownership and accountability for monitoring operations by upskilling & training programs
Celebrated people by winning group-worldwide recognition awards: Innovation in AI/ML, Test automation & App development
Improved overall people engagement of division scoring highest management metric as measured by OfficeVibe tool
Created architectural platform vision for the new technology stack thus managing expectations of journey timelines
Delivered on key project & product roadmap items meeting business objectives for all fiscal years to date
Introduced new operational behaviours & mindset changes, dress rehearsals, large event preparation management processes
Introduced new SOPs albeit manual, that resulted in improving platform stability & uptime, reduction in incidents raised
Transformed software engineering processes to full stack, paving the way for DevSecOps, removing silos across teams
Enabled development of in-house platform monitoring and intelligent dashboard tools improving NOC, 24/7 & command centre
Enhanced testing coverage driving load, performance & scalability testing, automated on production during off-peak times
Closed security gaps in architecture/implementation that reduced risk of revenue leakage & subscription management
Catalysed change in thinking in managing total cost of ownership: buy versus build, partner more, improving focus
Steered group's transformation to cloud services, improving stability & uptime e.g. Microsoft Azure Media Services delivery, AWS S3/CloudFront, Lambda, AWS Shield, R53, etc.
Reduced costs by up to 70% on core infrastructure, saving group in the order of R100m over three years
Resolved legacy technology components and services by offloading, de-supporting and transferring to other divisions
Demonstrated cloud innovation in streaming by enabling Africa’s first 4K/UHD streams over Azure media services for FIFA’18
Scaled platform capability doubling YoY to 10X+ MAUs with 50%+ increased engagement meeting KPIs
Introduced cloud transformation journey roadmap of at least 3 years: multi-DC app hosting, microservices, containers to AWS
Managed a large budget excess of $30m, operating largely within budget as well as delivering cost savings KPIs
Created technical & financial model for forecasting platform costs, introducing cost per user economics never seen before
Introduced multi-CDN partnering capabilities for redundancy, cost & improving risk management for disaster recovery scenarios
Reduced total cost of ownership by moving away from in-house built purpose systems to partnerships & leveraging cloud
Instigated enterprise technology transformation by optimising and transferring people/skills to other departments
Made difficult technology decisions that negatively impacted team morale but nevertheless maintained best business interests
Deftly & tactfully negotiated contracts resulting in better relationships, support plans & drove additional cost savings
Grew & improved relationships with vendors to be partner focused, thus enjoying symbiotic relationships than before
Turned around governance for both internal and external audits, resulting in Green report within 12 months of a red report
Improved monitoring & execution of antipiracy initiatives by developing tools that enabled quick takedowns of pirate streams
Shaped the strategic direction of technical platform consolidation, at the risk of disrupting the status quo
Key contributor to enterprise-wide governance & steering councils: Risk, Governance, Architecture, Procurement & Cloud
Drove cross-group consolidation of competencies, tools & technologies promoting reuse and improving synergies
Drove modern application development methods like feature flagging, A/B testing & application logging as well as common cross-platform application development framework on React.JS
Transformed the org into DevOps path introducing key concepts of "Mission Control", operational excellence, dashboarding, tooling, large event planning, crisis management runbooks etc.

Mo Khan's Outlet!

Pages

Tuesday, 17 August 2021

On Cloud Transformation, CTO reflections on scaling tech & people - Part 1 - Intro

Context

Technical & Leadership Challenges

Topics Covered

Results Delivered

No comments:

Post a Comment