Scaling Cloud Infrastructure

Christophe Popov
4 min readSep 29, 2022
credit: Snap Inc.

I recently read an interview with Jerry Hunter, SVP of Engineering at Snap.com about Snapchat’s scaling journey. I have previously written about startup scaling but I found the journey of Snapchat very interesting as not many startups reach that level of scale.

Snap.com journey: from serverless to multi-cloud

Jerry Hunter is telling how the original Snapchat (now Snap.com) in 2016 was a monolith deployed on Google App Engine (GAE):

“Snap, which launched in 2011, was built on GAE — FSN (Feelin-So-Nice) was the name for the original back-end system — and the majority of Snapchat’s core functionality was running within a monolithic application on it. While the architecture initially was effective, Snap started encountering issues when it became too big for GAE to handle”

Even at that time they were facing scaling issues especially during peaks. It was Google’s biggest deployment on GAE and Google engineers were working around the clock to try and support them, because all other clients were smaller scale and the system was never exposed to such load before.

The company has transitioned into a multi-cloud, micro-services based architecture:

“Today, less than 1.5% of Snap’s infrastructure sits on GAE, a serverless platform for developing and hosting web applications, after the company broke apart its back end into microservices backed by other services inside of Google Cloud Platform (GCP) and added AWS as its second cloud computing provider. Snap now picks and chooses which workloads to place on AWS or GCP under its multicloud model, playing the competitive edge between them.”

They split their service into three big layers:

  • cloud agnostic layer, with services that can be deployed into any compute service
  • services that depend on a specific provider but are easy to change
  • cloud specific services like AWS DynamoDB

They also make great use of regional cloud data centres and CDNs in order to reduce latency worldwide.

Currently they focus on cost optimisation both on the front end side (the app now only refreshes data that needs to be refreshed) and the back end side. For such a large deployment, cost savings can be massive.

Snap uses open-source software to create its infrastructure, including Kubernetes for service deployment, Spinnaker for its application team to deploy software, Spark for data processing and memcached/KeyDB for caching. “We have a process for looking at open source and making sure we’re comfortable that it’s safe and that it’s not something that we wouldn’t want to deploy in our infrastructure,” Hunter said.

Snap also uses Envoy, an edge and service proxy and universal data plane designed for large, micro-service service mesh architectures.

“I actually feel like … the way of the future is using a service mesh on top of your cloud to basically deploy all your security protocols and make sure that you’ve got the right logins and that people aren’t getting access to it that shouldn’t. I’m happy with the Envoy implementations giving us a great way of managing load when we’re moving between clouds.” — Hunter said

Snap.com service mesh architecture

AWS Well Architected Framework

AWS Well Architected Framework (AWS WAF) is developed by Amazon to help cloud architects build secure, high-performing, resilient, and efficient infrastructure for a variety of applications and workloads. Built around six pillars — operational excellence, security, reliability, performance efficiency, cost optimization, and sustainability. I had summarised the key ideas of AWS WAF. How does AWS WAF apply to Snap.com cloud architecture?

Operational Excellence: Snap is a business that requires 100% uptime and real time service across the globe. They are leveraging the best open source tooling (Spinnaker, Kubernetes, Envoy) to deploy workloads continuously and to monitor, scale and keep them running.

Security: the ability to run and monitor systems to deliver business value and to continually improve supporting processes and procedures. Snap relies on Envoy Proxy and the shared responsibility model (AWS and GCP looking after the physical infrastructure) to secure its services.

Reliability: the ability of a system to recover from infrastructure or service disruptions, dynamically acquire computing resources to meet demand, and mitigate disruptions such as misconfigurations or transient network issues. This aspect is handled by Kubernetes itself and the micro services architecture.

Performance Efficiency: the ability to use computing resources efficiently to meet system requirements, and to maintain that efficiency as demand changes and technologies evolve. This ensured by the elastic scaling capability of a micro services architecture deployed on a cloud.

Cost Optimization: Snap is putting an effort to optimise for cost, but Jerry Hunter is also warning is against premature optimisation.

Conclusion

Serverless deployments and monolith architectures could be great in early stages of a startup. It allows startups to go to market quickly and at a low cost. However, at present, serverless is not ideal for large scale deployments. A micro services based architecture, deployed on Kubernetes, organised in a service mesh is the norm.

References

How Snap rebuilt the infrastructure that now supports 347 million daily users

AWS Well Architected Framework

Spinnaker https://spinnaker.io/

Kubernetes https://kubernetes.io/

Envoy Proxy https://www.envoyproxy.io/

Service mesh data plane vs. control plane

--

--