The Doomsday

It’s Monday morning. You grab your coffee, check the monitoring dashboard, and notice the application’s CPU hitting 100%. Your Spring Boot service that processes location data is clearly under pressure, AWS RDS costs are rising, and the team is working hard to keep everything running smoothly.

Sound familiar? That was me a month ago.

We just took over support for this application from a client. It processes millions of requests for geocoding, timezone lookups, and weather data. Runs on AWS ECS with Aurora MySQL, uses Redis for caching, and apparently worked fine when the original team built it. But as traffic grew, we uncovered a bottleneck in the system that was holding back performance — and it became my responsibility to address it. In the process of removing this bottleneck, I enabled the system to support nearly twice as many clients, directly driving a significant revenue increase.

Find the root causes (The Debugging Journey)

I spent a couple of days digging into this problem, and honestly, it was way messier than I initially thought. Here’s what I discovered:

The Redis Dependency Problem

The original team’s caching strategy was simple – and wrong. Whenever Redis had a cache miss (which happened a lot during restarts or high traffic), the code would completely skip the database and hammer expensive external APIs instead:

// This is what they were doing.
public LocationData getLocation(String address) {
    LocationData cached = redisCache.get(address);
    if (cached == null) {
        return googleMapsApi.geocode(address); // Straight to expensive Google Maps API.
    }
    return cached;
}

When Redis went down or got evicted, the system would re-geocode the same addresses that had already been geocoded months ago. We all know Google Maps API isn’t cheap, and the client was essentially wasting money.

The Concurrent Duplicate

Under heavy load, multiple threads would simultaneously request the same data, so if 50 users wanted the weather for New York at the same time, the system would make 50 identical API calls to the weather service. Within milliseconds.

I discovered this when digging through the API logs and seeing clusters of identical requests with timestamps just milliseconds apart.

The Burst Traffic Issue

No rate limiting. No retry logic. When traffic spiked, the system would blast the external APIs, causing them to start throttling. This made everything slower, which in turn led to more retries, making everything worse—classic death spiral delivered on the plate.

The Timing Mismatch

This one was subtle but costly. The cache time to live was 20 minutes, the background job refreshed data every 20 minutes, but the freshness check was set to 17 minutes. So the system would refresh data 3 minutes early, every single time. Multiply that by millions of requests, and it adds up.

The Fixes (What Actually Worked)

After a lot of trial and error.

1. Smart Fallback Caching

I built a read-through cache that actually uses the database as a fallback:

public LocationData getLocation(String address) {
    // Try Redis first
    LocationData cached = redisCache.get(address);
    if (cached != null) {
        return cached;
    }
    
    // Redis miss? Check the database before hitting external APIs
    LocationData fromDb = database.findByAddress(address);
    if (fromDb != null) {
        // Put it back in Redis and return
        redisCache.put(address, fromDb);
        return fromDb;
    }
    
    // Only now do we hit the external API
    LocationData fresh = googleMapsApi.geocode(address);
    database.save(fresh);  // Save for next time
    redisCache.put(address, fresh);
    return fresh;
}

This simple change cuts external API calls during application restarts. Turns out most of this data already existed – it just wasn’t being used.

2. Request Deduplication

I implemented a simple deduplication pattern using ConcurrentHashMap and CompletableFuture:

private final ConcurrentMap<String, CompletableFuture<LocationData>> pendingRequests = 
    new ConcurrentHashMap<>();

public LocationData getLocation(String address) {
    // Check cache first (same as before)
    LocationData cached = getFromCache(address);
    if (cached != null) {
        return cached;
    }
    
    // If someone else is already fetching this, wait for their result
    CompletableFuture<LocationData> future = pendingRequests.computeIfAbsent(address, key -> {
        return CompletableFuture.supplyAsync(() -> {
            return fetchLocationData(address);
        });
    });
    
    try {
        return future.get();
    } finally {
        // Clean up when done
        pendingRequests.remove(address);
    }
}

Now, when 50 requests come in for the same location, only one actually calls the external API. The other 49 just wait for the result. Simple but effective.

3. Rate Limiting

I added proper rate limiting to the external API clients:

GeoApiContext context = new GeoApiContext.Builder()
    .apiKey(apiKey)
    .queryRateLimit(10)  // Max 10 requests per second
    .maxRetries(3)
    .retryTimeout(10_000, TimeUnit.MILLISECONDS)
    .build();

This smoothed out the traffic patterns and eliminated the death spirals that were happening before.

4. Getting The Timing Right

I synchronized all the timing intervals:

// Before: refresh every 12 minutes, cache for 15 minutes
private static final int REFRESH_MINUTES = 12;
private static final int CACHE_TTL_MINUTES = 15;

// After: everything aligned
private static final int REFRESH_MINUTES = 15;  // Matches business requirement
private static final int CACHE_TTL_MINUTES = 15; // Matches refresh cycle

Small change, but it eliminated thousands of unnecessary refreshes per day.

The Results

The transformation was honestly pretty dramatic:

  • CPU usage dropped from 100% to around 40% – could finally scale down the instances
  • Response times improved – fewer external calls meant faster responses
  • Reliability went up – the app now gracefully handles Redis outages and API throttling

What I Learned

Don’t Trust Cache-Only Strategies

If your cache is your only cache, it’s not really caching – it’s a single point of failure. Try to have a fallback plan.

Concurrency is Sneaky

I could have spent months optimizing algorithms and database queries before realizing that the system was making the same expensive calls multiple times simultaneously. AWS CloudWatch Logs, in this case, were a most beneficial source of truth.

External APIs

Third-party APIs will happily take your money and then throttle you when you use them too much. Rate limiting isn’t just nice to have – it’s essential for cost control.

Measure Everything

I thought I understood the traffic patterns from the initial analysis, but I was wrong. Actually, logging and measuring revealed problems that weren’t immediately obvious.

Infrastructure Matters Too

While I was optimizing the application code, I also tweaked the AWS setup:

  • Enabled SQS long polling (reduced empty receives by 90%)
  • Tuned the database connection pool settings
  • Moved the scheduled jobs to run on fewer instances

Things That Didn’t Work

Not everything I tried was successful:

  • Aggressive caching: I tried caching everything for hours, but the data got too stale for the users
  • Database denormalization: I considered pre-computing everything, but the complexity wasn’t worth it
  • Switching cache technologies: I evaluated other cache solutions, but Redis wasn’t the problem – the usage patterns were

Was It Worth It?

The 60% CPU reduction enabled me to scale down their ECS instances, reducing costs and allowing the system to handle nearly twice as many clients — turning savings into new growth.

But honestly, the best part was getting my Monday mornings back—no more weekend alerts about pegged CPUs. No more scrambling to understand why the app was slow. The system is now predictable and manageable.

If You’re Facing Similar Issues

Here’s my advice:

  1. Start with measurement – you probably don’t understand your traffic patterns as well as you think
  2. Look for duplicate work – concurrent requests doing the same thing are more common than you’d expect
  3. Build fallback layers – single points of failure will bite you eventually
  4. Rate limit everything external – your wallet will thank you
  5. Deploy incrementally – big bang releases are risky, especially for performance changes

And if you need to consult the reliability of your own cloud projects…

Let’s talk!

Kamil Supera, a stalwart Backend Developer and Tester at Makimo, deftly channels his passions into curating insightful articles on the intricacies of testing and AWS. With Python as his mainstay, Kamil weaves a realm where logic meets magic, advocating for continuous enhancement and striking to the root of every problem. Often found sharing his industry wisdom on Makimo's blog, he acts as a guardian for quality and an enchanter of codes. When not immersed in digital complexities, he retreats to nature's sanctuary, embracing the tranquility of trees and streams far from urban clamor. Kamil's enduring fascination for problem-solving, coupled with his love for the great outdoors, defines his unique perspective, both as a professional and an individual.