Back to all posts

Migrating Millions of Requests Without Anyone Noticing

Large-scale migrations are terrifying. You're essentially rebuilding the plane while it's flying, except the plane handles millions of requests per day and if you break it, real people can't book their vacations. No pressure.

Over the past year and a half, I've been working on migrating our lodging content service from a legacy monolith to a modern distributed architecture. This post is about the playbook we used to do it without anyone noticing. By "anyone" I mean customers. Our ops team definitely noticed. They were very involved.

The Setup

Our legacy service, let's call it LSCS (Lodging Content Service), had been running for years. It worked. It was battle-tested. It also had about a decade of accumulated technical debt, was hard to change, and didn't fit well with our newer architecture.

The new system, Product Entity, was built around modern principles: smaller services, better separation of concerns, gRPC for internal communication. The problem was getting there without breaking anything.

Here's what made this migration particularly tricky:

The Playbook

Phase 1: Build the bridge first

Before we could migrate anything, we needed a way to run both systems simultaneously. We implemented what we called "shadow mode": the new system would receive the same requests as the old one, process them, but not return results to customers. Instead, we logged the differences.

This sounds simple but took real effort. You need infrastructure for:

We caught a lot of bugs this way. The new system was returning slightly different data in edge cases. Some of these were bugs in the new code. Some were actually bugs in the old code that we'd been living with for years. We fixed both.

Phase 2: Feature gates everywhere

Every piece of new functionality was behind a feature gate. This let us:

Feature gates added complexity to the code, but they were absolutely worth it. When you're dealing with millions of requests, "we'll just roll back the deploy" isn't good enough. You need finer control.

Phase 3: Traffic ramp-up

Once shadow mode showed the new system was producing correct results, we started actually using it. But not all at once.

Our ramp-up schedule looked roughly like this:

  1. 1% of traffic to new system, with automatic rollback if error rate spikes
  2. 5%, then 10%, monitoring for latency regression
  3. 25%, then 50%, watching business metrics (conversion rates, etc.)
  4. 75%, then 100%

At each stage, we waited until metrics stabilized before moving forward. Some stages took days. Some took weeks. You can't rush this part.

Phase 4: The chicken-and-egg problem

Here's where things got interesting. The legacy system had some data that the new system needed, but we couldn't get that data without the legacy system. Classic chicken-and-egg.

Specifically, we needed "controlled offer IDs" that LSCS generated. These IDs were used by downstream services. If we deprecated LSCS, we'd lose the ability to generate them. But we needed to deprecate LSCS to complete the migration.

The solution was finding another service (LCD, ListDescendantIdentifierMappings) that could provide the same IDs through a different path. This involved:

Fun fact: during validation, we discovered that LCD's data was actually more accurate than LSCS in some cases. We found 24 vacation rental properties with incorrect inventory IDs in the legacy system. The migration made things better, not just equivalent.

What We Monitored

Monitoring was critical. Here's what we watched:

Technical metrics

Business metrics

Diff metrics

Things That Went Wrong

Not everything was smooth. Some notable incidents:

The latency spike: When we hit 25% traffic, we saw latency increase. Turned out the new service was making more downstream calls than expected. We had to optimize the batch endpoint to fetch property and host data in parallel.

The logging bill: Our shadow mode logging was more verbose than anticipated. We were logging full responses for comparison, which added up fast. Eventually we had to be smarter about what we logged, which led to a broader effort to optimize our Splunk costs. (That's a whole other story that saved $114k annually.)

The timezone bug: Dates were being interpreted differently between the two systems. This only showed up for properties in certain timezones and only for bookings near midnight. Classic edge case that you only find at scale.

Lessons Learned

Start with observability

Before you migrate anything, make sure you can see what's happening. We spent significant effort on dashboards and alerting before writing any migration code. This paid off many times over.

Small batches beat big bang

Every time we tried to do too much at once, we regretted it. Small, incremental changes are easier to debug, easier to roll back, and less scary for everyone involved.

Business metrics matter more than technical metrics

At the end of the day, it doesn't matter if your p99 latency is 5ms faster if conversions are down. Keep business metrics in view throughout the migration.

Documentation is not optional

We documented every decision, every rollback, every bug we found. This was invaluable when onboarding new team members mid-migration and when we needed to explain decisions to stakeholders.

Celebrate milestones

Migrations are long. It's easy to lose morale when you're months in and still not done. We made a point to acknowledge when we hit 25%, 50%, 75%. Small celebrations keep the team motivated.

The Result

We completed the migration. The old service is deprecated. The new architecture is running in production, handling millions of requests daily. Customers didn't notice anything changed, which was exactly the goal.

The new system is easier to maintain, easier to extend, and fits better with our overall architecture. Was it worth the effort? I think so. But ask me again in a year when we've had time to actually build new features on the improved foundation.

If you're about to embark on a large migration: good luck. It's hard, but it's doable. Take it slow, measure everything, and don't skip the shadow mode phase. Your future self will thank you.