Last year we launched The Cloud Native Attitude a short book describing modern infrastructure tools like Docker and Kubernetes and which included 3 case studies on real life Cloud Native enterprises (The Financial Times, Skyscanner and ASOS). We’re about to re-release the book with 2 new chapters and 2 more case studies: the challenger bank Starling and ITV. In the run-up to KubeCon, where we’ll unveil the new book, we’ll be releasing the new chapters plus some bonus materials! First we are going to start with the story of Starling Bank. How on earth did they build a bank in a year?!
Who Are Starling Bank and What Do They Do?
Starling Bank was founded in 2014. Based in London, it has been licensed and operating since July 2016. The bank is a successful part of the British Fintech scene, which is a spin-off from the UK’s strong financial services sector.
Starling are a mobile-only, challenger bank who describe themselves as a “tech business with a banking licence”. They provide a full service current account solely accessed from Android and iOS mobile devices.
They received $70m of investment in early 2016.
Starling’s tech comprises a cloud-hosted backend system, talking to apps on users’ mobile phones, and third party services.
As well as a full current account, the Bank provides MasterCard debit cards (customers spend money on their SB debit card and the authorisations and debits arrive at Starling servers through third party systems). They also support direct debits, standing orders and faster payments, which are again provided by backend integrations with other third party systems.
Starting in 2016, Starling created their core infrastructure on Amazon Web Services (AWS) inside just 12 months. Their highly articulate CTO Greg Hawkins likes to say, “we built a bank in a year”.
In common with everyone I’ve interviewed for this series of case studies, Starling use a microservices architecture of independent services interacting via clearly defined APIs. As of March 2018 they have ~20 Java microservices. That number will increase.
Many companies architect their services for division by team. This is known as a “Conway’s Law” approach, where each team looks after one or more dedicated microservices. However like the FT, Starling have chosen not to do this for now. Instead, they have divided services by functional responsibility rather than team and every service can be developed on by multiple teams. They operate this way because they can. As Hawkins puts it, “we’re taking advantage of the flexibility we get from our small size – we can reconfigure ourselves very quickly.” As they continue to grow, Greg recognises that they will lose some of that flexibility, “it won’t last forever”, and will then adopt smaller microservices and a more Conway-like model.
Deployment and Operations
Whilst services can be deployed individually, for convenience Starling usually use a simultaneous deployment approach where all services in the backend are deployed at once. This is a trade-off that has evolved between minimising the small amount of overhead around releases and keeping release frequency up. They built a rudimentary orchestrator themselves to drive rolling deploys based on version number changes (scale up AWS, create new services on the new instances, expose those new services instead of the old ones, turn off the old ones and scale down their AWS instances).
Starling generally re-deploy their whole estate 4-5 times per day to production. So, new functionality reaches prod rapidly, and it’s business-as-usual to apply security patches fast when necessary.
As always, API management is a tough challenge for frequent deployments. You could argue (naively) that simultaneous deployment makes this easier because you are always re-deploying both sides of your API at once, but this isn’t really true for several reasons:
- Starling don’t mandate simultaneous deployment and retain the ability to deploy services individually. Simultaneous deploy is a convenience that will change as the organisation grows.
- During the minutes a deployment takes to roll across all the servers, services are inevitably at different versions.
- Any individual service may fail to deploy, leaving mismatching versions in production.
The system must handle all of this safely, which means clients and services must incorporate backwards API compatibility. To ensure this, part of their release process is validation that there are no breaking API changes (this is straightforward to check using the Swagger tool combined with the fact that their client-side calls are in isolated “connector” libraries). As their system size has increased, they’ve also started introducing Pact to help.
From the start, the bank used Docker containers as a packaging format, and EC2 instances as their “units of isolation” (i.e. to separate one running service from another). They do not yet use containers as their primary form of isolation, although they do use them to isolate some specific processes such as components of their monitoring application Prometheus. They also don’t use an orchestrator. However, they are looking closely at the popular open source orchestrator Kubernetes (K8s). Specifically, they are interested in
- the abstraction it provides to machines and applications, which helps with portability (going cross-cloud)
- the cost savings and improved performance they would get from using containers as their units of application isolation and running on larger VM instances
- the sophisticated additional deployment options that Kubernetes provides.
Starling have made a strategic decision not to take on the operational overhead of managing Kubernetes themselves on AWS, but they are closely watching the progress of AWS’s managed K8s service (EKS) and are likely to use that in the future if it reaches their required level of functionality and stability. That’s not a crazy decision, there is significant ops work involved in managing Kubernetes on AWS yourself.
Cloud-wise, Starling’s infrastructure is entirely hosted on Amazon and they are happy there. However, regulatory requirements and commercial considerations mean they’ll need to diversify into cross-cloud in the future. They’re therefore beginning to work with the Google Cloud too, but that has a few interesting challenges. The Google Cloud is more advanced than AWS in some areas but way behind in others.
- Google’s managed Kubernetes service, GKE, is currently much better than EKS.
- Starling have built a lot of custom advanced security features like temporary privilege raising on top of AWS’ strong APIs that will need to be re-implemented for Google.
Like any company choosing to use multiple cloud vendors, Starling will need to balance the value of consistent operations against the desire to get the best out of both clouds.
Stack-wise, Starling are a Java house.
- They deploy their 20 Java services with an embedded web server inside Docker containers.
- They configure their estate using CloudFormation plus homegrown scripting.
- They make heavy use of the Nginx load balancer and ELBs.
- They use an ELK (Elasticsearch, Logstash, Kibana) stack for logging and Grafana and Prometheus for monitoring.
- On the client-side (mobile applications), they use Java for Android and Swift for iOS.
Architecturally, Starling have an interesting set of needs:
- security and data integrity are their highest priorities
- they have some room for manoeuvre on performance.
Security-wise, they sensibly make extensive use of service isolation at the network level (aka “microsegmentation”) for which they use separate VPCs as well as subnets. They encrypt all data in transit and at rest, and their inter-service communication is via encrypted RESTful interfaces. They also have a strong focus on user-device security. Specifically:
- They guarantee that they are always talking to the correct, original device (achieved through private keys).
- They are careful to ensure the device has not been compromised. This is why, for example, you cannot run their apps on jailbroken devices.
Starling offer user-facing services that are latency-sensitive. However, their own backend is seldom the performance bottleneck for those services. Card transactions, for example, pass through several layers of third party systems before reaching the bank’s servers. Starling’s systems generally introduce <5% of the latency on these performance sensitive operations.
So, Starling’s backend perf would have to be very severely impacted before it was noticeable by end users. They can therefore afford to optimize their architecture for robustness, simplicity, auditing, and data integrity rather than super-speed.
This high need for resilience and auditing and the slightly lower requirement for operational performance influenced their decision to use asynchronous APIs between their services. Each service has its own in-built asynchronous inbound command bus (a kind of queue) backed by a database. This architecture provides reliable message passing, rigorous decoupling, resilience, auditability and replayability as well as better understandability for the system. Given their operational priorities, asynchronous APIs were a sensible choice.
From a testing perspective, Starling Bank embraced Chaos from a very early stage. Chaos engineering is an approach to testing critical systems that was pioneered by Netflix. The idea is that testing to destruction happens in production. Sounds crazy? Actually it makes good sense. It ensures that not only your functionality is tested but also your production system can quickly identify and recover from issues. It’s like meta-testing. It’s painful at first so one to start gradually!
Starling are very happy with the architectural choices they have made so far. Again, they demonstrate that not everyone needs to or should make identical decisions.
They have chosen not to orchestrate in production yet. They judge that costs them money in hosting but currently makes their operations simpler. Once AWS release a stable EKS, Starling will use managed Kubernetes on AWS. That is a perfectly sensible approach that works well for them.
They have chosen async over synchronous inter-service comms because they prioritise auditability and reliability over hyper-performance. Again, they have weighed the trade-offs and made a decision based on a good understanding of their own current situation and needs.
Their original motivation for hosting in the Cloud was Hawkins anticipated it would help Starling move faster. He felt that Infrastructure as a service, by supporting Devops and an iterative approach, would help him create an innovation culture in his tech teams. That very much appears to have paid off.
Overall, Starling Bank seems to be an excellent example of the need to consider context when making architectural choices. They seem a sensible bunch, I’m very tempted to move my current account there!