Last year we launched The Cloud Native Attitude a short book that provides an easy-to-read grounding in modern infrastructure tools like Docker and Kubernetes and which included 3 case studies on real life Cloud Native enterprises (The Financial Times, Skyscanner and ASOS). We’re about to re-release the book with 2 more case studies: the challenger bank Starling and ITV. In the run-up to KubeCon Copenhagen, where we’ll unveil the new book, we’ll be releasing the new chapters plus some bonus materials!
2 weeks ago we talked about how Starling built a bank in a year using AWS. Last week we talked about what you could learn about eventual consistency from Starling’s architecture . This week I’ll introduce the other new case study: ITV.
A Gripping Tech Drama
ITV PLC was created at the start of the 21st century through the merger of several large independent British TV franchises that had existed since the 1950s’s. It is now a major international presence with annual external revenues of ~£3.0B. Online, pay & interactive services make up a growing share of their business. Their online viewing is expected to be up by 34% in 2017 with 20m registered users of ITV Hub, ~151m hours of ITV programmes consumed online and 665m long form video requests.
Tech-wise, for the past four years ITV have been involved in an ambitious legacy migration story, from mainframes to cloud microservices, whilst recovering after winning through the 2008 recession. Basically, their cloud migration is a nail-biting thriller.
Let’s start our tale at what looked like the end for ITV. For a company reliant on advertising revenues, the financial crisis was a disaster. Their share price dropped to 18 pence (it’s now around ten times that) and their survival depended on a radical new strategy. They hired a CEO with a vision to quickly restructure ITV. That included completely rethinking their tech.
Their first strategic move was a quick application of cost cutting and consolidation. They transferred the support and ongoing maintenance of their existing (legacy) code and data centres to a multinational outsourcing company. At this point. their traditional waterfall processes helped them. It made it easier to offshore their tech and the move was a clear success for the company; they achieved their target cost reductions and better predictability. However, the strategy came with longer term challenges.
In this new offshore world, ITV could make modest changes to their systems with a reliable 4 week turnaround, which was great for backend systems in relative stasis. However, as the economy recovered their vision turned to consumer-facing tech. To compete in that world, slow but predictable Waterfall would not be enough. As Tom Clark, their energetic Head of Common Platform puts it, they decided to try to a “2-speed system”:
- Continue with cheap and predictable offshore maintenance of internal, legacy systems that seldom changed using Waterfall methods
- Use a faster, Agile process for new online and consumer products like their ITV Hub.
As Tom says, “we believed ‘one size fits all’ didn’t fit anyone well”.
However, they hit problems. According to Tom, “The new development was as iterative and fast as we intended. The problem was operations.” In their new model the old, Waterfall-based operations teams still delivered the VMs for testing and launching the new online products. Waiting 4 weeks for this vital infrastructure rapidly became their new bottleneck.
ITV decided their next strategic step in increasing their feature velocity was to try using DevOps for their online products. They experimented with allowing the development teams to provision their own machines for test and production on Amazon Web Services (AWS) through a series of proof of concept (PoC) deployments. These PoCs were a huge success. So, they decided to step back and re-assess their tech strategy once again.
It was clear the combination of Cloud and DevOps could have a significant, positive impact on the speed of development of their consumer facing products and it was obvious that ITV should embrace these for online. Great. However, it was also becoming clear that ITV’s legacy internal systems like talent payment, content delivery and rights management were falling behind those of the rest of the industry and of new entrants. A Waterfall approach was keeping those applications stable but that wasn’t enough.
They decided the Agile, Cloud and Devops strategy they had trialled for consumer products needed to be extended to the legacy services their business had relied on for decades. They chose to apply what they’d learned from online to their back office systems.
At this point, ITV’s perspective as a company with a lengthy history behind them and hopefully a similar future ahead of them paid off. They took the long view.
They had used ITV Hub (their streaming video service) to learn and build expertise in the cloud. Now they needed to extend this new infrastructure across their organisation. They followed a three step process
- Identify legacy migrations with potentially strong ROI.
- Experiment using MVPs and assess ease and risk.
- Move the applications (or parts of applications) to the cloud IF there was good bang for their buck (payoff).
Eventually, they would migrate much of their legacy code but they needed to pick a place to start. This was a classic mix of technical, cultural and business strategic decision making.
Following this process caused ITV to rapidly realise they needed a “Common Platform” built on AWS for product-based dev, test and deployment. Like many early adopters, back in 2014 they had to build their own.
Their Common Platform V1 comprised technology, but a common team structure also organically evolved for unifying agile development, infrastructure operations, and autonomous ownership:
- 2+ developers
- 1 Scrum master
- 1 Product owner
- 1 Platform engineer (devops)
The platform engineer played a crucial role in every team, handling:
- Initially On Call (first responder).
- Team efficiency (automation and scripting).
- Quality (operationally, what’s going to go wrong?).
Tech-wise, ITV’s Common Platform V1 was based on application isolation via AWS instances, Centos OS, and automation using Terraform, Puppet, Jenkins and home-grown scripts. Initially, there was no use of containers or off-the-shelf orchestrators. As platform providers to the ITV development teams, they didn’t mandate architecture (although microservice architectures are common amongst the platform users).
The Common Platform offers services to the development teams, which are recommended but not mandated. They strongly advise the dev teams to use Postgres (provided via RDS) and they have their own managed RabbitMQ clusters, for example.
They chose to self-manage Rabbit over AWS’ own managed queue service (SQS) because of one of their initial guiding principles: there must be a fallback. They used PaaS offerings only where the underlying tech was Open Source. They therefore always had a potential escape plan of operating it themselves. For that reason, ITV’s Common Platform does not expose AWS’ SQS, Aurora or DynamoDB managed services to the dev teams.
As well as services, the platform provides diagnostics: alerting, monitoring, logging and observability.
- Sensu for operational monitoring (alongside PagerDuty and Slack).
- Telegraf, Influx and Grafana for metrics.
- The ELK stack (Elasticsearch, Logstack, Kibana) for logging aggregation.
They have found it incredibly useful to maintain a strict separation of dev alerts from prod ones. The teams never get alerts for stuff they cannot (or should not) fix.
The Common Platform V1 was a success and ITV are now thinking about V2. The aim of V2 is to adopt off-the-shelf technology wherever possible to replace the homegrown, i.e. the V2 strategy is to move as much as possible of the Common Platform to commodity tech. They intend to embrace containers and an open source orchestrator alongside carefully considered constraints on service behaviour.
There are many concepts from the Common Platform V1 that have been very successful and that ITV will maintain with the move to V2, including:
- A concept of “blast radius reduction”, where every team’s stack is currently completely isolated from one another, so issues with one service cannot impact the running of another service.
- This is true even for monitoring, alerting and diagnostics. There are no common instances, instead there are duplicates.
There are pros and cons to this choice of duplication over commonality. The downsides are increased costs in hosting and management for these service duplicates. However, in Clark’s experience those downsides are outweighed by the benefits of decoupling on stability and on speed of development. A major “side channel” benefit of the duplication is a reduction in monitoring noise. In alerts their teams only see events generated by their own systems; they never need to worry about issues generated by other teams.
Looking back, is there anything they would have done differently? With 20:20 hindsight, they realise their fear of vendor lock-in did hold them back. The overhead of remaining completely cloud-agnostic was high. In future, they may decide to just use a vendor service by default and pay to move if necessary later. Of course, a few years ago we had no idea that the vendor services were going to remain relatively inexpensive and develop at the rate they have.
Overall, ITV’s migration from extreme legacy (fifty years!) to the cloud has been a fascinating story arc with a happy ending. I’ll be very interested to see what happens in season two.