Feb 19, 2021

How Guiding Principles Help Us Scale Engineering Operations

Loom experienced two periods of hyper-growth in 2020, ending the year with our core business metrics having increased multifold. However, with that growth came scaling challenges, especially on our primary database. During this period we learned that as our platform started to grow in complexity, query access patterns that used to perform fine were now becoming bottlenecks. 

The growth phase from August 2020 in particular resulted in various optimizations made in our data layer — and sooner than we planned for. To address these challenges, the Infrastructure team leveraged our two guiding principles: 

  • Do less, well. 🚀 

  • Buy first, build later. ⚡️

In this post, I’ll share how we apply these guiding principles to navigate a high-growth period and to overcome engineering complexities while moving quickly and ensuring we deliver value to our customers — our main goal — as Loom continues to grow. 

Struggling to keep the lights on during high growth in 2020

In August 2020, we noticed the application would experience severe latencies during our peak usage hours. We were able to tie these latencies back to provisioned IOPS (PIOPS) in our Postgres RDS instance being thoroughly exhausted.

HTTP status codes screenshot
The figure above shows the growth increase in traffic within our ingress layer grouped by HTTP status codes.

We did upgrades in the past where we ensured dedicated resources in the form of provisioned IOPS, which was extremely generous for our scale at the time. However, as we started diving deeper into the latency root causes and contributing factors, we came across many areas for improvement in our data layer — and we decided to be more strategic in this area.

As the latency issue unfolded, Loom engineers from various teams found some quick breathers and optimizations.

  • What we found: Once a JSONB column exceeds a specific size, it is serialized, chunked, and sent to a TOAST table 🍞. Reads/writes to these columns incur heavy PIOPS. 

  • The solution: We migrated large JSONB columns into their own table(s) or moved to various cache clusters (wherever it made sense).

  • The impact: To avoid any further impact to our application caused by fully exhausted PIOPS during peak usage hours, we introduced read replicas to our primary RDS instance. Introducing the read replicas was an immediate, big win since we can now distribute the PIOPS load across multiple instances for reads. 🎉

Now that we were able to buy ourselves more time, we took the lessons learned from our August 2020 growth and decided to be proactive about how we would handle the next phase of growth. 

Loom Co-founder and CTO Vinay Hiremath created a dedicated project called Project DB Scale. Engineers across the org rotated in and out of this project over time. The goals of this time-boxed project (Q3 and Q4 2020) were to identify the bottlenecks in our data layers, ensure our data stores were properly provisioned, improve instrumentation and monitoring, and finally address the issues we were seeing from the replication lag.

How we implemented a two-pronged approach to reduce replication lag and build a unified data access layer

After we introduced read replicas to our Postgres RDS instance, replication lag became a growing concern for us. Occasionally we noticed weird behaviors in our product experiences and errors from reading stale row versions. At this time, the replication lag was in breach of our already permissive SLO (service level objectives) of 700ms over a time window of 5 minutes a few times a week.

We decided to address the replication lag with a two-pronged approach:

  • Provide a standardized data access layer that leverages read-through and write-through cache (something we refer to as flat model caching), which also has built-in strategies for addressing inevitable replication lag.

  • Start looking into the fundamental design of the underlying data store. While we made our application tolerant to these lags, we found that any SLO over 100ms would result in suboptimal customer experience — so an SLO of 700ms (over 5 minutes!) was unacceptable to us.

Exploring solutions, leading with guiding principles

Keeping our guiding principles in mind, we started exploring alternative replica architectures, and Amazon Aurora emerged as a top contender. Amazon Aurora is wire compatible with Postgres and the replicas share the same data volume as the primary node, which were some of the main reasons we wanted to move quickly in trying it out.

We started by writing an engineering RFC (request for comments) — a technical document that outlines the problem, goal, potential solutions, pros and cons, and alternatives when evaluating solutions. The RFC made a compelling case for why Amazon Aurora was worth our effort: how the service could provide many performance advantages while keeping our core technology stack simple at a low cost of implementation as Loom prepared for the next phase of growth.

While replication lag was the original motivation, we were quite excited to tap into the other benefits, such as high availability and durability that AWS Aurora would bring us. Then we began the work to try Amazon Aurora in our environment.

Rolling out Amazon Aurora in staging

After the RFC was approved, we immediately decided to roll out Aurora to our staging environment. We also used this as a dry run for our production rollout. This involved:

  • Building and testing new tooling for taking the application into partial downtime. ⚙️

  • Writing new terraform modules to bring up a new Aurora Cluster from a snapshot, with automated monitors. 🧩

  • Re-deploying our PgBouncer setup (sticky connection poolers) without any impact to real-time query workloads. ✅

  • Running warm-up worker jobs for the new database (to populate the buffer cache and  load the index). 🔋

The results of our due diligence in regard to our guiding principles paid off. Our migration to Amazon Aurora in our staging environment was smooth, and we ended up with areas for improvement in our tooling and a checklist for launch day to keep the cognitive load to a minimum. We were aiming to have a very boring and uneventful launch day. 🥳

Gaining confidence before shipping through load testing and query replay

At this point, we hadn't come across a single issue on staging, which left our team feeling nervous. We knew the volume on our production application was much higher than our staging environment and relying on signals only from the staging environment wouldn't be wise.

We were also nervous about any shortcomings we might see later, despite Aurora being wire-compatible with Postgres. We needed a way to gain confidence that one of the main motivators of this migration — driving down our replication lag — would be significant and observable prior to the production rollout.

Amazon Aurora published standard load test figures that have significant increases over PostgreSQL performance, and while they look impressive, we wanted to know how Aurora would perform against our application's specific read/write patterns. We fell back on our guiding principle of "Do less, well" and discarded ideas of any significant application rewrites or spinning up a new project. Avoiding any major application rewrites or refactoring critical parts of the application was a goal in the engineering RFC.

We explored various options, such as traffic mirroring and dual writing, but each solution came with its own variance of added complexity within the application, requiring additional engineering efforts or limited guarantees with reproducibility. We then came across pgreplay, which reads a PostgreSQL log file (not a WAL file), extracts the SQL statements, and executes them in the same order and with the original timing against a PostgreSQL database. The pgreplay option seemed attractive to us, and we tried it out. We were then able to compare metrics like CPU Util, Replication Lag, IOPS, Commit throughput, and much more side by side for the same time frame against our live database. 🎉

Replica lag screenshot
We captured and replayed two hours worth of production database queries from peak hours against the Amazon Aurora cluster spun up from a snapshot.

The production database comparison metrics gave us the further confidence we needed. We locked the dates for launch day and started the prep work. 🛠

Rolling out Amazon Aurora in production

At Loom, we take downtimes and maintenance windows seriously, so we came up with a way that our essential product flows, like viewing looms, would be uninterrupted during this entire downtime period. We decided to go into a partial downtime for a few hours on an early Saturday morning, when our traffic volumes are relatively lower. We achieved this by putting the database in read-only mode and had the application fail open on any errors for any writes that were blocked by the database — which made us feel more comfortable about the quality of the end-user downtime experience as we went into the maintenance window.

The rollout here didn't look much different from the one on staging. It was quite boring (just as we intended), but fellow Loomates in the Zoom hangout, including our CTO, made this a rather memorable event as we cracked jokes and traded previous war stories. 😁

I share early numbers after we rolled out to Amazon Aurora in production.

Next steps: Long-term application strategies for our data access layer as we enter a new chapter of growth

Although Aurora has greatly contributed to the decrease in our replication lag, we continue to invest in our underlying data stores as we expect a continued and significant increase in our usage throughout 2021. 

We are exploring and implementing various data layer and caching strategies. For instance:

  • We have built a way to direct queries natively and on-demand to the primary instance in the database cluster (as compared to replica instances) to circumvent any issues of replication lag breaching the defined SLO.

  • To scale for reads on our highest volume endpoints, we are experimenting with flat model caching where we are reading from dedicated cache clusters that store the objects from various models (aka tables).

The entire project was a collaborative effort from several talented engineers across Loom. Thanks to our guiding principles, things are working well so far, and we are quite happy with the outcome and the approach we took.

However, this is only the beginning. We continue to invest further into our data access and caching layers, ensuring these utilities are built on a strong foundation with established patterns for reads and writes as we enter this new chapter of growth.

We’re hiring! Check out our current Engineering positions.