No QA Environment? No Problem: How Classpass Enables Testing on a Single Environment in EC

January 19, 2026

Transcript

Po Linn Chia: My name is Po. Today I’ll be talking about No QA Environment, No Problem: How ClassPass has Enabled Testing on a Single Environment in ECS. ClassPass, Mindbody, I’m on the ClassPass side of the org. I think we’re now also called Playlist. ClassPass is a subscription-based platform. People subscribe to us, and we help members book fitness and wellness experiences by filling the excess inventory of many thousands, I think 70-plus thousand wellness and fitness partners.

If you want to go work out, you can go download the app, go try that out. Nobody cares about that, really. You care because you’re an engineer. We have 90-plus engineers. We’re not small, but we’re not huge. Of our 90-plus individual contributors, about a third are backend, a third combine our mobile and frontend, and we have five whole platform ops engineers. Even though we’re small, we serve a lot of requests, so 150,000 requests per minute, about 200 million, 300 million per day, depending on where we are in our traffic pattern. We regularly stair-step 20% more traffic every January, as everybody tries to lose weight on the 1st of January. Then we somehow convince them to keep trying to lose weight, or to go to spas for the rest of the year. That’s us. You can get a sense of where we are.

I’m going to actually point out that no QA environment is actually quite a lot of problems. This is not a talk about the test pyramid, where we’re injecting different types of integration or unit tests. This is actually a talk about how we journey through the evolution of our test architecture, from a very small team, to where we are now. Hopefully some of this is relatable, or because you are going through this yourself, or because you’ve gone through this before, and you want to see some MS Paint diagrams of the bad times.

At the same time, if you are not on this scale of architecture, we sit on ECS, we’re not a Kubernetes cluster. That’s what my talk is about. It’s a story about building both technology and teams, because I think as a theme that we’ve seen in lots of speakers’ talks, everything is sociotechnical. The social part of the equation is probably harder than the technical part.

Infrastructure

What does our infrastructure look like at ClassPass? We are microservices galore, with a single ECS development environment cluster. Anybody else here run on only a single test cluster? I’m glad I lured a lot of people with far more robust production and staging clusters into this talk. This is us. That’s our challenge. Maybe you aren’t on the infrastructure now, but if you’ve been in the industry a while, have any of you been in the situation where you press a button to deploy? Hopefully, many of us have pressed a button. Some of us have copied and pasted onto production servers. We press a button on ClassPass to deploy up to ECS. We’re not on CI/CD. Then inevitably at some point, this happens. We’ve got an unhappy dev environment, because multiple people are playing tug of war over one of our microservices. Our services control a lot of business logic, but we have a lot of shared business logic.

In our six plus-plus teams, you can have people contending over the same microservice, or one microservice impacts another microservices behavior. You end up with a lot of arguments about, I need this environment. At the end of the day, we have a contention problem. We tried to solve this problem in a variety of ways when it came to testing and making sure that our deployments were reliable and robust. First of all, and again, maybe this is something folks have gone through, we tried to solve this problem by having really robust integration tests that wouldn’t need the dev cluster. We raised up the entire world using this homegrown CLI program that had a fancy acronym.

When you joined the company, you’re like, what is that? It’s called FIT inside of a fitness company. What’s going on? It looked like this. We had these Jenkins agents that we were self-maintaining, and then the CLI tool that had a mysterious name. It pulled every one of our microservices into a single agent and used some magical homegrown orchestration built on top of Docker that wasn’t quite Docker Compose. It was a little nuts. It was really clever for its time, but by the time we had grown to the scale we are now, we were starting to see 15-to-30-minute runs just to get the framework stood up before you even ran your test.

It pulled every Docker image. Builds were failing just because we were downloading something like 800 megabytes plus-plus every build. The maintainer had left the company. Have you had one of these situations? Nobody knew how to maintain it. We had HOST:PORT CONFIGURATIONS. We were doing the networking ourselves. We had two PRs for every integration test, one for your service under test, and one where you wrote the service in this magical tool. Then you had to stitch it together. There were YAML files. It was nuts. Our tests were always flaky. We tried to say, let’s test using integration tests on the backend.

Theoretically, our backend developers never have to deploy into development. Our mobile and frontend engineers will be happy. That was not the case. Our frontend engineers were always upset because in order to test, we would just go to dev. Someone deploy a tag, and now my end-to-end tests are failing. Our mobile and manual QA engineers were even more upset. Even further down the chain, their web views were failing. Their everything was failing. There was nothing they could do about it, because the backend engineers, we essentially bullied people and pressed the red button.

Team Structure

That was a huge technical problem of contention. In the meantime, we had slid into this point where our team structure was also no longer viable for what we were trying to do. We had grown, and sometime before we had grown, we’d written this homegrown tool. It worked for a little while. Maintainer left, and now we were all so focused on shipping product. Again, we are growing 30% every year. Product is king, I’m sure. Everyone is familiar. We had no team. We see a lot of DevEx and ops folks talk about empathy and about team topologies with enablement teams that cross both ops and product teams that need a tool. We didn’t have any of those things.

Nobody owned the explosion living on Jenkins. We didn’t have a testing strategy, because the strategy had been written sometime in the year 500 BCE. That was it. There was this wall. How many people have tried to climb this wall or just avoid the wall? Google Maps is like, just avoid. Ops will be mad at you, or if you’re on ops, you’ll be like, product engineers are speaking in tongues. I only speak Groovy, Python, and YAML. It just had a very standard division of responsibilities. Product people would be like, Docker image pulls took 14 minutes, and then the network timed out. The ops people were like, why are you pulling 1.2 gigabytes? Do you know we pay money for that? It was bad, and so contention. For every one team trying to test one thing, we blocked n-1 teams. As you scale, that does not scale, n just gets bigger, and that was bad.

Fixing Things, on the Cheap

What did we try to do? We tried to fix things cheaply, because we will always try and fix things cheaply, especially when your team is lean. This is the beginning of our Pokémon evolution. We’re not quite Charizard yet, we’re moving up. We made some assumptions. We assumed that if we could test the backend well, kind of like trickle-down economics, you have trickle-down effects, where the rising tide lifts all boats, mobile and frontend will be stable. We’ll always have main deployed, and we can do that by fixing the thing that was on fire. Maybe we can just tune this in-house framework. We already built it, and we did put a lot of effort into it. It had huge suites of integration tests that were valuable to us. We wrote them for a reason. We would hope to get closer to this psychological idea of CD. Maybe since it was in-house and we knew it, we could do this with a really lean team in a short amount of time, what I call cheap and cheerful. This is what the team looked like. It was two people, three people, looking into this black box. Last commit four years ago.

The little Docker whale on the inside begging to be unbeached, but its API had changed. Are you going to spend two people’s time rewriting a framework that you knew needed a lot of optimization to work well, and that we knew was flaky because we were spinning up databases. We were mocking data that wasn’t aligned with either production or development, it was going to be flaky anyway. We looked at this and we tried. We tried once. We tried twice. I think we tried three times with what we call pop-up squads. Cheap and cheerful. We’ll steal you from product for a while. The product managers won’t notice. We’ll hide you in the line items. That doesn’t work. In truth, we were also working against this assumption that ended up being totally wrong. Engineers deployed to development as part of our process. For better or for worse, we needed development data.

The stuff that was already a little bit drifted from production, but better than these 5-year-old mock test data generation objects that were being uplifted in our little black box of doom. We abandoned that. Literally, we just turned off integration tests, which is a terrifying thing to say. We said, let’s test. Is our deploy rate of rollback any higher if we turn off this thing that is adding 30-plus minutes to our build time and it’s clogging up all of our build agents and reducing developer productivity? We didn’t see an enormous change, which was very, I think, sobering. We realized we can’t sink more cheap and cheerful, not really cheap, not really cheerful things into our test infrastructure. We’ve got to change it, but we also can’t change the whole world at once. We’re not going to wake up and say we’re CI/CD, everything is great. We have everything you need, canary deployed. We can’t change the whole world at once. We’ve got to work with what we had.

What did we do? Did we create an enablement team? No, we tried something else in the evolution called, if we can’t test it, we’ll monitor it in production. Better to catch it after it’s deployed on the system with the data and the supporting infrastructure that’s real than to test this fake thing that didn’t give us anything. This is how every engineer trying to write tests for this felt. You are instructed to try and write read-only tests. How did you know it was going to be read-only? Our production systems were not ready to accept test data. I’m not going to say that you can’t test in production. Plenty of people do and try that.

The system we would eventually try to adopt that I’ll be talking about is inspired by Uber, and they do things like that. They push code into production, but they silo it. They’re ready for it. We were not ready for it. It was very scary. Where we didn’t have tests, we tried to backfill with monitors. Monitors just tell you things are already on fire. It’s not testing. It just wasn’t viable. It didn’t fix things for backend engineers, frontend engineers, mobile engineers, or our QA engineers. Everything costs money in the end. We were trying to save money, but at the end of the day, it’s the classic Catch-22. Investing in testing infra is expensive, but not having tests is extremely expensive. As you can expect, we were not catching bugs early enough.

How To Serve Everybody

How do we serve everybody? This is, again, secretly a sociotechnical talk. Unfortunately, part of the solution for us engineers, especially as senior-plus engineers, is you can’t just sit down and write code. We had to report up and to come to an agreement. Eventually our leadership decided, you’re right, we need to shift left with testing, invest in automated systems, catch things earlier, otherwise that classic cost of bug chart is going to come for us in our bottom line, and we need to prioritize tooling and coordination. What I’m going to go through is how we did this with a team that is still very small. We couldn’t quite get a huge amount of headcount, but there’s a difference between putting a couple of somewhat random people on a project and having this mindset of, we have an enablement team, and maybe they’re a little bit of a tiger team. They’re very small, but they’re important. We will clear the way for them to do the work that they need to do, and we’re going to think about how their work is going to look like six weeks from now, one year from now, two years from now.

All of the iteration that we did, it was expensive, but at least we got a little bit of return on investment, which is that on the sociotechnical front, implementing solutions is easy. I’m going to show you some slides, and you all are going to be like, great, git pull Traefik, done. The hard part is figuring out the unspoken criteria for your organization, and that is what those iterations actually gave us. That was the real ROI, is that we knew we had constraints on budget and headcount.

We knew that the system couldn’t be too complicated because we don’t have this dedicated DevEx enablement team to maintain it and keep it alive. We knew that we had to have that bridge between our platform ops team and our product teams because one of them writes the tests, and one of them keeps the infrastructure that the tests run on alive. We have to work with what we’ve got for the moment. Nature doesn’t move in jumps. We can’t force people into CD and magic testing. For now, if manual RIP button deploys are the way, maybe we need to work with that in order to stabilize the tests we do have, like our end-to-end tests or our regression tests. This is not a talk about types of testing. The test pyramid has been written about a lot, and modern versions for 2025 are coming out.

Technical Solution

How did we solve this problem technically? One environment, but we want multiple versions on it, and we need to work with our constraints, which is that the ECS infra we have and the application load balancers we were using were not very conducive to us doing this. It’s always DNS. Who here uses DNS, not to service discover, but just to access your services? I know a lot of us are like, yes, we’re totally going to install that service catalog and we’re going to talk to console. No, but the easiest thing is DNS, and then the hardest thing is DNS. For us, it was literally the case. We have hosts just hardcoded into every single one of our 80-plus microservices to talk to the 79 other microservices.

Then it hits this application load balancer, and then sometimes it just goes to an nginx for fun, and then it goes back to the application load balancer, and then it finally goes to the other service. Every container ECS service was sitting behind this DNS, and how do we deploy multiple versions of a service? It’s going to look crazy. It’s going to look like this, like Route 53 is going to look like Octopus 53. The load balancer doesn’t like having a million target groups.

In fact, on Amazon, you can only have 100. Since we have both HTTP and gRPC traffic that need separate rules, target groups, we were easily overloading our load balancer. In IAC, it’s not easy to step up load balancers on demand, or worse, destroy them. There’s all this overload, and this is us fighting our product instead of our product enabling us. If you can’t beat DNS and you can’t beat the LB, what do we do? We just go around. First thing, we want to deploy multiple versions. How do we do that? We want to be able to deploy them via CI so that we can run tests against them, integration tests, all in code. We also want to start sneaking in this concept of CD, where stable main is always deployed somewhere, and for us, that somewhere is the shadow realm, and we also want to make sure that it got cleaned up so that FinOps doesn’t knock on our door and charge us money.

This is what our eventual end state was going to look like. For now, we were letting people keep their big red button so you can deploy whatever you want to development-SERVICE.classpass.com. When you hit that host, that’s where you go. Behind it, you would have a shadow main, which was always the current main version. Then somewhere behind that, if you want, you can stand up any feature branch you want in CI or even manually.

If you needed to test something really intense, you could spin up a separate ephemeral container and target that. How do we do that? We do that with what we call dynamic routing. Really, it’s simple. You install a reverse proxy. We use Traefik, but it could be anything. I’m sure any K8s folks, you have ingress controllers a bit more in there by default, but we install Traefik, which is a reverse proxy. Traefik is capable of picking up routing rules from labels inside your task definition, which is very easy to append using IAC. We just take your CI deploy. We see a tag. We’re like, great, dump that on the Docker label and ship that up to ECS. Traefik has access to the ECS catalog and your task definition, and it says, you want a version? We’ll give you a version. We still have a bit of a tangle, but instead of this LB that could only take 100 routing rules, we now have Traefik, which, as far as the documentation says, and we found, has unlimited routing rules and backends, and everything just flows through.

Here’s an example of what our Traefik dashboard might look like. We have a service called availability, and we put on these header rules which say, if there is a header called baggage, and if baggage has content that says dynamic_route = feature – blah, blah, blah, let’s go to that ephemeral container, feature 811. If it has the dynamic route of shadow, then send it to that shadow main. After all those rules are evaluated at the very bottom, if it’s just the DNS that we’ve always been familiar with, no baggage sitting on the request, just go to wherever the big red button says to go to. We are not disrupting the current developer experience, but we’re providing them with this shadow realm that will hopefully always be stable, and the ability to ephemerally spin up things in their PRs and on demand to do whatever testing they want.

What’s this baggage stuff? Who here uses OpenTelemetry? We used OpenTelemetry too. We came from this era of trying to do things before the era of OpenTelemetry, where you make correlation IDs and request IDs and propagate them, and it is lousy. OpenTelemetry is great and is very good at propagating. We just installed the OpenTelemetry agent and clients just have to set the baggage header on their request and everything works.

As I said, if I’m a QA engineer and I want to talk to tag 888, I put that on my baggage. I’m CI, I’m running all of the integration tests against the tag in my PR, I can do that. Octocat is happy. I’m a frontend developer and I just want the E2E tests to work. I’m going to hit the shadow main. What does this look like in practice? Super easy. We are primarily a Java stack, but OpenTelemetry has agents for nearly every language on the universe at this point. We just had to install a Java agent, do some tagging, run our service. This is copy pasted from OTel’s docs, and everything just works. How do we make life easy for our developers? This is where the enablement team was starting to happen. Platform was spinning up a reverse proxy for us.

The backend team was learning how to write those routing rules. The frontend team began to figure out how to set cookies to propagate that baggage. You could just open up, for us, develop.classpass.com, set a cookie and every request you make would target the shadow main if you desired. Mobile engineers were also starting to get empowered. They were like, we’ll just create this little debug menu in our development builds that lets you just type in what baggage you want and then we’ll propagate that too. Every layer was starting to be involved in our little pop-up team and we were beginning to build that bridge.

The Cost?

What was the cost at the end of the day? We like cheap. We like cheerful. We only spun up the containers. We didn’t duplicate our infrastructure, like databases, queues, that’s very expensive. Maybe we’ll do that in the future but not right now. It’s relatively cheap because our development containers are small. We don’t need an additional cluster. We don’t need to expend time to write that IAC, and our EC2 instances will scale as and when needed. By large, we run very limited number of containers in dev. Our instances were already underutilized so functionally we didn’t see an increase in cost.

On a sociotechnical side, because we spent time working on the strategy, we painted the elephant and we say, how do we eat this with a very small, lean team if we’re allowed to spend a year, year-and-a-half eating the elephant instead of trying to do it in four weeks? The work is now broken down into chunks. We cycle engineers on and off because again, we’re still very lean and we’re building this idea of an enablement team. There’s handoff. Instead of just one ops person knowing how Traefik works, and one deaf person knowing how to write routing rules, we cycle engineers through the project. At this point, I think we have had at least 50% of our ops team work on the product, and at least five, if not more, backend engineers, and two frontend and mobile engineers work on it, which is a huge number and way more than just, ops, black box is broken. We intend to continue that trajectory over time.

Results

What did we eventually enable? We’re still adopting this. It’s not a magic bullet. Again, we are somewhere in the second third of sophistication. We’ve borrowed a lot of this from much larger companies. We’re presenting here because maybe you have the same scale issues that we do. What do we get out of it? Our integration tests live again. Every PR is soon going to be able to spin up their ephemeral containers and write integration tests. Everything is wired up to add the baggage for them and target shadow mains for all other services. Because we do have a little bit of a spaghetti issue where our microservices call one another. It’s very important that we target stable versions of other services. We’ve wired up the architecture so that you can write the services either in a shared repo or just inside of your own repo. They can move back and forth. We’re not asking you to make two PRs for every change.

If you find that your test is critical to the survival of our business, you can move them up to a shared repo which can be enabled everywhere else. We’ve reduced contention. Because now the big red button is only one way that you can deploy things. Engineers can deploy on demand outside of CI. They’ve used this to do some really cool things already. We’ve had teams do giant upgrades of a React framework and just test that in the shadow realm for a long time, because we are not able to disrupt the development frontend for any number of reasons. We’ve had backend engineers test big framework changes. We’re a drop wizard framework users, and we were 5 years behind versions. We spun those up and we were able to test that.

Now we’re starting to empower mobile and frontend engineers. We are on the way to stabilizing end-to-end and regression tests by having those shadow mains. Even as we roll that out, mobile engineers can already say, I want to route around the thing that our backend engineer put up. They can do it themselves with cookies and with their debug menus. They don’t have to call us and be like, how do we set the baggage? Like, what? There’s knowledge that’s flowing through the system.

Challenges

It hasn’t all been sunshine and rainbows. The hardest part is the sociotechnical battle that happened. Not a battle, but a coordination that’s taken over a year-plus to do. The technical challenges have been, it’s hard. We went from Jenkins, which we’re managing ourselves, over to GitHub Actions runners, because platform doesn’t want to have to deal with patching. I agree. Now we have to VPN into our VPC. I don’t know if anybody uses WireGuard, but the dragon has not been very friendly to us. We’ve had to learn how to do networking because you take away one problem and you add another. We’ve had to realize that our system as it was, was very dependent on certain quirks of AWS in order to work. It turned out that our containers did not have health checks, because all of our health checks were written on the LB level.

The LB would ping some /health endpoint that would do robust health checks, but it wasn’t a container health check. Which is what Traefik uses to determine if a service is healthy or not and able to be served. If you don’t have a health check, Traefik just assumes you’re healthy. We actually destabilized frontend end-to-end tests for a while, because as a service was spinning up, Traefik was like, yes, it’s ready. Ship traffic over there. Our frontend engineers were like, what is going on? Things are failing, but only for two and a half minutes. We had to learn that. We had to untangle our own archeology.

Then, of course, as with the adoption of any new technology, it’s super hard to educate people in that first initial phase of adoption. The major problem sometimes is people asking if it’s pronounced Traefik versus traffic, and is traffic flowing through Traefik? Then you’re like, I don’t know what is going on and why did they put the e in? Now we have different setups. Traefik is not production ready for us. We might never do it because we don’t want to have that setup in production. Now our production infrastructure looks different from our development infrastructure. While it’s there for a good reason, if you’re on the outside looking in, or a new engineer, you might be like, why? Where did the LB go for internal development traffic? There’s a lot of learning. Even the most plugged-in engineer has to deal with endless cognitive load. That’s why we’re trying to develop these clearer smaller systems, because we’re already overwhelmed with all the things we have to know.

For the time being, until this is embedded in our engineering culture, we’re asking our developers to do one more bit of knowledge acquisition, like one more Confluence talk. That is a mountain we’re still trying to climb. We’re excited. We’ve seen good results. We’ve seen good feedback. The best feedback about testing that we’ve had in years. We are going to trick everyone into accepting CI/CD as a way of life. The upside down is going to be the right side up eventually. We’re going to harden out the infrastructure, make sure it scales, because that was our mistake the first time. We’re going to try not to make it again. Eventually, integration tests will come back to life. We will get them out of the graveyard of our black box, get them fully installed again. Now we’re going to move on to the next arc in test infrastructure journey, which is test data generation. What do we do about development data versus production? How do we tackle all of those things, and graduate on to canary and whatnot?

Key Takeaways

Sociotechnical, you can have a small team, but with strategy and political will, you can do great things. Don’t give up. It’s possible. Don’t try and work against the grain so much. You can go online and read this awesome Medium blog post and try and implement those things. Your org structure is not going to permit you to do that until you work to make it happen. It’s going to be slow work over time. Try and enable as many types of testing as possible, even if it doesn’t fit the perfect practical testing pyramid with all the nice linear boxes. You’ll get there, but you have to start small. On a technical level, it’s not that hard. It’s not that easy. It’s possible with a little bit of open-source reverse proxying and OTel being a great asset to our community at large. You too can make sure that it’s not always DNS.

OpenTelemetry

We have OTel agents installed on every service. OTel is extremely good at intercepting requests. One of the most fun parts of proof of concepting this was like, we tried OTel many years in the past and it was sort of ok. Then we installed it and we’re like, it’s going through our gRPC clients. It’s going through all of our HTTP client libraries in the JVM and Node. A little bit in Python. It just worked. I was not expecting that. Very rarely do you just plug it in and it plays. OTel has been great. It has plugged and it has played. I think there’s a reason why every vendor out there who’s talking about observability is going to ask you if you’re using OTel. Like I said, it’s a three-line installation. It’s the same thing on Node, on JVM. It’s just this. It hijacks your requests, off you go. Then, practically speaking, on our integration test, we’ve written helper scripts to make sure that any code you write, we inject that baggage for you. If you’re testing branch 1, 2, 3, we set that baggage header for you. All you have to focus on is writing your integration test.

Questions and Answers

Participant 1: Any challenges with environment promotions with the shadow realm in play?

Po Linn Chia: Yes. We’re not promoting through environments. Again, the shadow main is there as an entry level CI/CD, but because it is now our first CD ever, we are experiencing the adoption curve of CD. We’ve had the challenges of WireGuard not working as we needed it to access our VPC from the outside in order to do the deploys. We’ve begun to resolve that. I’m sure that problems will come with load because we’re doubling our cluster. For now, it’s been suspiciously decent, because we’re not interrupting the current dev workflow. As long as the builds are working and we’re deploying, we can debug iteratively. Folks can get around us the way they have always lived. We are not going to upend their world until they get used to the fact that shadow mains are there and to begin testing extensively against them. We’ve had some VPN configuration issues on the operation side.

On the social side, we’re easing people into it. It’s been exciting because we’re not forcing them, but we are showing them that it exists. We’ve had teams volunteer to use this technology, which I think is the greatest sign of you having done a decent DevEx job. Very few people want to use a new tool. That’s been our signal.

Participant 2: How do you handle testing anything that requires a data change? Like a migration to your database, changing the table schemas, anything like that in dev?

Po Linn Chia: That is one of our, we’re going around that for now areas. We’re of the opinion, and this is part of the journey, I think, into CI/CD, that if you are putting a migration into main, that that should be appliable. This freaks developers out at the moment. When we deploy a shadow main, we do not run migrations and we do not run infrastructure creates. We continue to just push up a new ECS task definition. As this stabilizes, we’re going to have strong discussions about whether that’s going to change and whether we apply your infrastructure changes for you automatically in a CD world.

Like I said, the hardest part of all of this is the social, cultural aspect of engineering. People are very used right now to bundling migrations in with code changes, which makes it very fragile if you need to roll it back or forward. I think that’s going to be the big challenge for us. For now, we are routing around it, but we’ve put it on the elephant roadmap. I suspect, going forward, we will try and get those things applied automatically as real, front of the house development becomes CI/CD.

See more presentations with transcripts

 

Search

RECENT PRESS RELEASES