May 2024

Migrating Millions of Requests Without Anyone Noticing

12 min read Distributed Systems, Migrations, Engineering

Large-scale migrations are terrifying. You're essentially rebuilding the plane while it's flying, except the plane handles millions of requests per day and if you break it, real people can't book their vacations. No pressure.

Over the past year and a half, I've been working on migrating our lodging content service from a legacy monolith to a modern distributed architecture. This post is about the playbook we used to do it without anyone noticing. By "anyone" I mean customers. Our ops team definitely noticed. They were very involved.

The Setup

Our legacy service, let's call it LSCS (Lodging Content Service), had been running for years. It worked. It was battle-tested. It also had about a decade of accumulated technical debt, was hard to change, and didn't fit well with our newer architecture.

The new system, Product Entity, was built around modern principles: smaller services, better separation of concerns, gRPC for internal communication. The problem was getting there without breaking anything.

Here's what made this migration particularly tricky:

Millions of daily requests across Expedia.com, Hotels.com, and Vrbo
Multiple downstream services depending on our data
Complex business logic that had evolved over years
Data that needed to stay consistent during the transition

The Playbook

Phase 1: Build the bridge first

Before we could migrate anything, we needed a way to run both systems simultaneously. We implemented what we called "shadow mode": the new system would receive the same requests as the old one, process them, but not return results to customers. Instead, we logged the differences.

This sounds simple but took real effort. You need infrastructure for:

Duplicating traffic without adding latency to the main path
Comparing responses at scale (millions of comparisons per day)
Categorizing differences (bug vs. expected difference vs. data issue)
Dashboards to track progress

We caught a lot of bugs this way. The new system was returning slightly different data in edge cases. Some of these were bugs in the new code. Some were actually bugs in the old code that we'd been living with for years. We fixed both.

Phase 2: Feature gates everywhere

Every piece of new functionality was behind a feature gate. This let us:

Enable features for specific properties first (start with low-traffic ones)
Roll back instantly if something went wrong
Run A/B tests between old and new implementations
Enable features gradually across different brands

Feature gates added complexity to the code, but they were absolutely worth it. When you're dealing with millions of requests, "we'll just roll back the deploy" isn't good enough. You need finer control.

Phase 3: Traffic ramp-up

Once shadow mode showed the new system was producing correct results, we started actually using it. But not all at once.

Our ramp-up schedule looked roughly like this:

1% of traffic to new system, with automatic rollback if error rate spikes
5%, then 10%, monitoring for latency regression
25%, then 50%, watching business metrics (conversion rates, etc.)
75%, then 100%

At each stage, we waited until metrics stabilized before moving forward. Some stages took days. Some took weeks. You can't rush this part.

Phase 4: The chicken-and-egg problem

Here's where things got interesting. The legacy system had some data that the new system needed, but we couldn't get that data without the legacy system. Classic chicken-and-egg.

Specifically, we needed "controlled offer IDs" that LSCS generated. These IDs were used by downstream services. If we deprecated LSCS, we'd lose the ability to generate them. But we needed to deprecate LSCS to complete the migration.

The solution was finding another service (LCD, ListDescendantIdentifierMappings) that could provide the same IDs through a different path. This involved:

Discovering the service existed (not obvious in a large organization)
Understanding its API and data model
Building a gRPC integration
Validating the data matched what LSCS provided

Fun fact: during validation, we discovered that LCD's data was actually more accurate than LSCS in some cases. We found 24 vacation rental properties with incorrect inventory IDs in the legacy system. The migration made things better, not just equivalent.

What We Monitored

Monitoring was critical. Here's what we watched:

Technical metrics

Latency (p50, p95, p99) for both systems
Error rates by error type
Request volume (make sure traffic is actually flowing)
Cache hit rates
Downstream service health

Business metrics

Conversion rates (are people still booking?)
Search result quality
Page load times
Customer support tickets related to content

Diff metrics

Response differences between old and new systems
Categories of differences (known vs. unknown)
Trend over time (differences should decrease as we fix bugs)

Things That Went Wrong

Not everything was smooth. Some notable incidents:

The latency spike: When we hit 25% traffic, we saw latency increase. Turned out the new service was making more downstream calls than expected. We had to optimize the batch endpoint to fetch property and host data in parallel.

The logging bill: Our shadow mode logging was more verbose than anticipated. We were logging full responses for comparison, which added up fast. Eventually we had to be smarter about what we logged, which led to a broader effort to optimize our Splunk costs. (That's a whole other story that saved $114k annually.)

The timezone bug: Dates were being interpreted differently between the two systems. This only showed up for properties in certain timezones and only for bookings near midnight. Classic edge case that you only find at scale.

Lessons Learned

Start with observability

Before you migrate anything, make sure you can see what's happening. We spent significant effort on dashboards and alerting before writing any migration code. This paid off many times over.

Small batches beat big bang

Every time we tried to do too much at once, we regretted it. Small, incremental changes are easier to debug, easier to roll back, and less scary for everyone involved.

Business metrics matter more than technical metrics

At the end of the day, it doesn't matter if your p99 latency is 5ms faster if conversions are down. Keep business metrics in view throughout the migration.

Documentation is not optional

We documented every decision, every rollback, every bug we found. This was invaluable when onboarding new team members mid-migration and when we needed to explain decisions to stakeholders.

Celebrate milestones

Migrations are long. It's easy to lose morale when you're months in and still not done. We made a point to acknowledge when we hit 25%, 50%, 75%. Small celebrations keep the team motivated.

The Result

We completed the migration. The old service is deprecated. The new architecture is running in production, handling millions of requests daily. Customers didn't notice anything changed, which was exactly the goal.

The new system is easier to maintain, easier to extend, and fits better with our overall architecture. Was it worth the effort? I think so. But ask me again in a year when we've had time to actually build new features on the improved foundation.

If you're about to embark on a large migration: good luck. It's hard, but it's doable. Take it slow, measure everything, and don't skip the shadow mode phase. Your future self will thank you.