Move fast to unbreak things
TL;DR: We had a major outage. We wanted to increase stability. The obvious way is to increase process & testing, but a better way is to ship more frequently.
The debate: To write tests or not to write tests?
At Remotion, whether or not to write tests is left up to the individual engineer’s discretion. Basically: “Will tests help you ship this feature faster?” There are some obvious places where tests help (e.g. server functions) and obvious places where they’re hard (e.g. integrations). But there’s a wide in-between space, and the the team frequently debates the right level of testing.
This resurfaced recently when we had a major user-facing outage. Our unexpected conclusion: instead of increasing stability by writing more tests, increase stability by shipping more frequently.
Our mistake: Shipping multiple changes to a complex system, all at once
We recently prepared for a product launch with large and risky changes. The changes were large and risky. In our rush to release, commits were coming in fast and the main branch was never quite stable enough to deploy. We ended deploying all the changes at once, under time pressure near the end of our release window.
During testing in our staging environment, we’d noticed some issues: Certain app interactions felt slow, and our Slack integration fired some duplicate notifications. After investigation, the issues seemed unrelated to our changes. We chalked the issue up to temporary Google Cloud Platform (GCP) or Slack server issues, and deployed.
The next morning, we got user reports of “5+ repeated slack messages” and app slowdowns—the same issues we saw but discounted when testing. Our first response was to mitigate the most critical user facing issue of repeated Slack notifications. We went for simple and just disabled the feature. To our surprise, disabling Slack notifications also fixed our servers’ slow response times!
Root cause: Adding a retry
Turns out adding a retry to a commonly called function was the root cause. When a user joins a “room” in Remotion, we need to both access and mutate rooms in our database. This often happens in bursts, such as when many people simultaneously join a room for standup. The retry was an attempt to work around the resulting contention issues.
However, the problem is that this code also calls a Slack API mid-transaction. When we retried, Slack quickly used exponential backoff rate limiting. Resulting in failed, slow transactions. Resulting in even more retries. And loop.
Ultimately, the issue was a combination of unexpected behaviors from multiple systems interacting with each other. The underlying architecture was flawed, but it took a small, seemingly unrelated change to break it.
Preventing repeat issues without slowing down
In our retro, we discussed what changes we needed to make to prevent this from happening again: The obvious reaction was to write new tests and add more rigor to processes like code review. For this specific issue, the tests we’d need would be complex mocks of external systems. Expensive and difficult to build accurately.
More generally, we weren’t excited about increased testing and code review requirements. Startups win by moving fast, and these options push us away from “Speed” in the classic engineering triangle.
Instead, we aligned on a radical alternative approach: Improve reliability by speeding up shipping to users:
Our solution: Ship more to unbreak quickly
1. Make it easy to deploy quickly with automation
Deployments can easily involve painful manual steps, especially if you ship native code on Apple platforms like us. We’ve found investing in automation and simplification to be well worth it. The easier it is to ship, the more it happens.
2. Make it easy to deploy quickly by creating a culture of trust and followup
In the face of mistakes, process builds up like scar tissue. Most of that process is unnecessary—in fact it’s probably demotivating to your strongest performers. Instead of adding process, celebrate mistakes. Make it an opportunity to reinforce the level of trust across the team. And build a culture of following up on releases rapidly in response to metrics or feedback.
3. Ship small pieces instead of large blocks
Shipping a giant project all at once is harder than shipping smaller milestones. We all know it, but projects frequently become monolithic releases despite our best efforts. It happens to us at Remotion all the time! We don’t have any silver bullets for this, but it’s useful to remind ourselves. Plus, frequent deploys makes shipping milestones much more rewarding.
4. Write tests when they speed up development
I always tell my team: “Tests are not process. They are a developer tool.”
Thanks for reading
Although we write more tests than this post may lead you to believe, the recent outage was a great opportunity to reaffirm the culture of trust and ownership that we’re building at Remotion. Building it is both a learning process, and a work in progress.
I’d love your thoughts and feedback. Just email me at charley at remotion dot com.
The case for virtual coworking: build a connected remote culture.
Regularly coworking with your hybrid or remote team can help you build the social cohesion that makes work feel less like work.
Here are the biggest reasons we think coworking is an effective way to create a close-knit remote culture:
1. It fosters casual conversations.
Building a connected remote culture is all about fostering 1:1 or small group organic conversations. Virtual coworking makes space for those conversations. When you spend time together outside of agenda-driven meetings, spontaneous chats naturally occur, as they would in an office.
2. It's more inclusive than scheduled social events.
It can be draining for introverts to have to participate in scheduled, purely social conversations. Coworking allows the team to spend time together and occasionally chat without having to constantly be "on," making it more inclusive for introverts and extroverts alike.
3. It's easy to say yes to.
Purely social events are important, but if your team is busy or on a tight deadline, it's tough to find the time for social chats without it feeling like an obligation. Coworking is much easier to get your team onboard with because it doesn't take time away from getting work done.
4. It improves remote collaboration.
Coworking can lead to unblocking and shorter feedback loops. Quick questions get answered easily and in the moment, without a having to schedule a meeting or go back-and-forth in messages.
5. It's scalable.
Coworking works for teams of all sizes and is a great way to scale your remote culture as your team grows. It's helpful to create opportunities for teammates from different functions to get to know one another.
6. It creates shared momentum.
The feeling of togetherness is motivating!
Get started with virtual coworking: choose the type most aligned with your priorities.
It takes intentionality to make virtual coworking feel natural and energizing enough to stick—it's not as simple as leaving a Zoom call open all day.
Here are a few of the ways we've set coworking up for our team. We recommend choosing one to start with. If it works, make it routine and experiment with other types from there.
Try independent coworking.
Try project-based coworking.
Best practices for virtual coworking.
Keep group sizes small.
Limit your coworking sessions to 4-6 people to keep things from getting distracting and help make introverted teammates comfortable chatting.
Signal boost coworking.
Set a norm of letting the entire team know when you're hopping into a coworking room or session.
Make it routine.
Once you've figured out what kind of coworking works for your team, make it a regular, opt-in event. Set up a recurring calendar event to do it at the same time each week to maximize the impact.
Set expectations ahead of time.
When you're first introducing coworking to your team, share what you're imagining in your calendar invite and at the top of each session to get everyone on the same page. For example:
Let's try virtual coworking! We'll work independently on our own projects with our cameras off, but we'll share space and listen to music together — like we might work side-by-side at the office.
Listen to music together.
Play music while you work to create a shared environment and add a little bit of personality to your coworking session.
Set up Coworking Rooms in Remotion.
Most of the above is doable with any video chat app, but much easier with Remotion—which we designed with a lightweight, smooth coworking experience in mind. Easily set up Remotion rooms that your teammates can hop into for different styles of coworking.