3 strategies on a single node, 3 strategies across multiple nodes — with real-life examples, production patterns, and future-ready guidance

Credits / Acknowledgements
This blog is based on detailed discussions and whiteboarding sessions with Sourabh Kumar Banka and Jatin Goyal.

Why you should care (even if things “work fine” today)

Race conditions don’t usually show up in development. They show up when:

traffic spikes,
retries kick in,
background jobs overlap,
autoscaling adds more instances,
latency increases (so overlaps happen more often).

If you’re building modern systems (cloud, microservices, async workflows, distributed caches), you’re going to face concurrency whether you like it or not.

The core idea to remember:

Every correct system serializes updates somewhere.
Your design decision is where that serialization happens and what trade-offs you accept.

First principles: What is a race condition?

A race condition exists when all three are true:

Shared mutable state
Something that can change: a DB row, cache entry, session, balance, inventory count.
Concurrent actors
Two or more threads/processes/nodes can operate “at the same time.”
A non-atomic update
The update happens as:

Read → Compute → Write

If two actors do this concurrently, you can break invariants.

A simple invariant example

“In USER(id=1), the final name should reflect the last accepted update, not a random overwrite.”

The classic failure mode: Lost Update

Two requests read the same initial value, then one overwrites the other.

Running example we’ll use throughout

A USER table with a version column:

USER(
  id BIGINT PRIMARY KEY,
  name TEXT,
  phone TEXT,
  version BIGINT NOT NULL
)

Two requests arrive roughly together:

Request A: set name = "jating" for id=1
Request B: set name = "jatink" for id=1

Both start by reading:

SELECT id, name, phone, version FROM user WHERE id = 1;

Now the question is: how do we ensure these updates don’t corrupt state or overwrite each other incorrectly?

Real-life analogy (keep this in mind)

“Two people editing a contract”

If both edit without coordination, one person’s changes get overwritten.
If they use a checkout/lock, only one edits at a time.
If they use a version number, the system can reject stale edits.

That’s exactly what we’re doing—just with databases and services.

Part 1: Single node — 3 practical strategies

A “single node” means one service instance (one JVM / one container). Concurrency is mostly threads and async tasks.

1) Database row lock + transaction isolation

Principle: “Make the update mutually exclusive by locking the row.”

When you lock, you force a single order of writes.
You serialize at the database layer.

Example: Pessimistic row locking

BEGIN;

SELECT id, name, phone, version
FROM user
WHERE id = 1
FOR UPDATE; -- row lock acquired here

-- compute changes in application

UPDATE user
SET name = 'jatink'
WHERE id = 1;

COMMIT;

What happens under contention?

First transaction gets the lock.
Other transactions trying to lock the same row wait.
You get strict ordering: T1 then T2 then T3.

Real-life analogy

A bank locker has one key. Whoever holds it can change contents. Others wait.

Pros

Strong correctness
No “retry storms” (others block, they don’t thrash)
Simple to reason about for single-row updates

Cons (this matters in production)

Blocking increases latency under contention
Deadlocks possible if multiple rows are locked in different orders
Throughput can drop if locks are held too long

Production-grade best practices

Keep transactions short. Do not call external APIs inside a transaction holding locks.
Lock rows in a consistent order (e.g., ascending by id) when locking multiple rows.
Use lock timeouts to avoid infinite waiting.
For work-queue patterns, consider:
- FOR UPDATE SKIP LOCKED (process jobs without competing for the same row)
- FOR UPDATE NOWAIT (fail fast if lock isn’t available)

Isolation levels (quick intuition, not theory-heavy)

READ COMMITTED: common default; can still see changes between reads.
REPEATABLE READ: stable view within a transaction (varies by DB details).
SERIALIZABLE: strongest; closest to “as-if one at a time,” usually slower.

Rule of thumb:

Use stronger isolation only when you can’t protect invariants another way.
Serializable is powerful but can become a bottleneck quickly.

2) MVCC / Optimistic locking (version check)

Principle: “Let everyone run; detect conflicts when writing.”

Optimistic locking assumes:

conflicts are rare
reads are common
blocking is expensive

The pattern

Read the row with its version:

SELECT id, name, phone, version
FROM user
WHERE id = 1;  -- suppose version = 7

Attempt update only if version still matches:

UPDATE user
SET name = 'jatink',
    version = version + 1
WHERE id = 1
  AND version = 7;

If UPDATE affects 0 rows → someone else updated it first → conflict.

What happens under contention?

Many threads read version 7 simultaneously.
Only one wins and increments to 8.
Everyone else fails and retries.

Real-life analogy

You pick a product off a shelf with a price tag (version).
At checkout, if the price tag changed, the cashier rejects it and you must re-check the price.

Pros

No blocking (better latency when conflicts are rare)
Excellent for read-heavy systems
Works well with caching and scale

Cons (the hidden killer)

Under hot-key write contention, it becomes:
- wasted CPU (everyone computes)
- wasted DB calls (failed updates)
- “retry storms” that amplify load

Future-ready retry strategy (do this, always)

Bounded retries with backoff + jitter:

Max attempts: 3–5
Exponential backoff
Add random jitter
After that, return a clear conflict response (or queue the update)

Pseudo:

for attempt in 1..MAX:
  success = updateWhereVersionMatches()
  if success:
    return OK
  sleep(baseMs * 2^attempt + random(0..jitterMs))
return CONFLICT (or enqueue)

What to monitor (so you can evolve safely later)

optimistic lock failure rate
retry counts
p95/p99 latency during contention
DB CPU and lock metrics

If conflicts rise over time, you’ll need to shift strategies (often to key-based serialization).

3) Application-level pessimistic locking (serialize by key)

Principle: “Enforce one-writer-per-entity before hitting the DB.”

If your real contention is:
“the same user/session/account is updated frequently,”
then serialize on that key in the application.

Pattern A: Striped locks (simple, efficient)

Instead of creating one lock per user (memory leak risk), use a fixed array of locks and map keys to a stripe:

import java.util.concurrent.locks.Lock;
import java.util.concurrent.locks.ReentrantLock;

public final class StripedUserLock {
  private final Lock[] stripes;

  public StripedUserLock(int stripesCount) {
    this.stripes = new Lock[stripesCount];
    for (int i = 0; i < stripesCount; i++) stripes[i] = new ReentrantLock();
  }

  public Lock lockFor(long userId) {
    int idx = (int) (Math.floorMod(userId, stripes.length));
    return stripes[idx];
  }
}

Usage:

Lock lock = stripedLock.lockFor(userId);
lock.lock();
try {
  // read -> compute -> write for this user
} finally {
  lock.unlock();
}

Pattern B: Per-key queue / “actor” model (strong ordering)

Route updates to a single-thread executor per partition:

hash(userId) → partition → single worker thread → DB update

This gives you:

strict ordering per user
no retry storms
stable behavior under load

Real-life analogy

A dedicated clerk handles one customer account at a time.
Other requests for that account wait in a line, not in the database.

Pros

Eliminates retry storms for hot keys
Reduces DB lock pressure
Predictable ordering and latency
Often the best option for “hot entity” systems

Cons

On a single node, it’s great—but across nodes you need coordination (next section)
Requires architecture discipline (all writes must go through the serializer)

Why “3-way contention” often shows on a single node

Because within one JVM, you may have 3 threads all doing:

read
compute
try to write

With optimistic locking, all 3 can execute simultaneously, then 2 fail → it feels like “3-way” active contention.

With row locks, only one runs while others wait; you still have 3 contenders, but only one active.

Part 2: Multiple nodes — 3 distributed strategies

Now imagine your service runs as 5 pods behind a load balancer. The same user update can hit different pods.

Now the challenge is:

How do we serialize updates across machines that can fail independently?

This is where distributed systems get hard.

1) Distributed lock

Principle: “Put the lock in a shared coordinator.”

You store the “key” in a system everyone can see:

Redis (with care)
etcd / ZooKeeper / Consul
DB advisory locks / lock tables

Flow:

Acquire lock(userId)
  if acquired:
    update DB
    release lock
  else:
    wait / retry / fail fast

Real-life analogy

One shared booking system for a meeting room across multiple offices.

The critical future-ready detail: Fencing tokens

Distributed locks can fail in subtle ways:

Node A gets lock.
Node A pauses (GC/network).
Lock expires.
Node B gets lock.
Node A resumes and still tries to write (stale owner).

Fencing tokens solve this:

Every lock acquisition returns an increasing token.
Your write includes the token.
DB rejects writes with older tokens.

Conceptually:

lock(userId) returns token=42
write must include token=42
DB rejects token < latestToken

If you use distributed locks, fencing tokens (or an equivalent “I am still the valid owner” mechanism) are what make it future-proof.

Pros

Strong mutual exclusion across nodes
Simple conceptual model

Cons

Adds latency
Coordinator can become a bottleneck
Needs careful correctness handling (leases, fencing, idempotency)

2) Saga pattern

Principle: “Don’t lock globally; use local transactions + compensation.”

Sagas are for multi-step workflows across services, where distributed locking would be too slow or too fragile.

Example: user offboarding across services

Disable user login (Auth service)
Revoke tokens/sessions (Session service)
Remove PII (Profile service)
Notify downstream systems (Events)

If step 3 fails, you compensate (or mark as partial and retry).

Two styles

Choreography: services react to events
Orchestration: a saga orchestrator drives the steps

Real-life analogy

Planning a trip:

If hotel booking fails, cancel flight.
You didn’t lock every provider; you compensated when something failed.

Production-ready requirements (this is what makes it future-proof)

Idempotency on every step (retries are normal in distributed systems)
Outbox pattern (persist event + local state atomically, then publish)
Inbox/deduplication for consumers
Strong observability: step state, retries, compensations

Pros

Scales extremely well
No global locks
Handles long-running operations cleanly

Cons

Eventual consistency (not instantly consistent)
Compensation logic can be complex
Requires discipline (idempotency, state machine)

3) Two-phase commit (2PC)

Principle: “All participants agree to commit or none do.”

2PC gives strong atomicity across multiple systems:

Prepare phase: “Can you commit?”
Commit phase: “Commit now.”

Real-life analogy

Escrow: the deal closes only if all parties confirm.

Pros

Strong consistency and atomicity

Cons (why it’s not future-friendly at scale)

Can block if coordinator fails
Adds latency
Poor scalability under high throughput
Operationally complex

Use 2PC only when strict atomicity is mandatory and throughput is not the priority.

Why multi-node often looks like “2-way contention”

In many distributed data systems, each key has a “primary owner.” Writes funnel through that owner or through a single serialization point.

So contention looks like:

many clients trying to update
one owner coordinating
→ “clients vs owner” feels like 2-way contention.

Future-ready patterns that age well

If you want this blog to remain useful as systems scale, these patterns are the ones that keep winning.

1) Idempotency is non-negotiable

In distributed systems, duplicates happen:

retries
timeouts
at-least-once delivery

Every write path should support an idempotency key:

“Apply operation X exactly once (or safely multiple times).”
Store the key + outcome.

2) Backpressure beats retries

Unbounded retries can destroy systems under load.

Future-ready design includes:

bounded retries
queueing
rate limits
circuit breakers
load shedding

3) Observability: measure contention, not just errors

Track:

lock wait time
optimistic lock conflict rate
retries per request
deadlocks
queue depth (if using serialization)
saga compensation counts
p95/p99 latency

If you can’t see contention, you can’t improve it safely.

4) Partitioned ownership: “single writer per key across the fleet”

This is often the most scalable alternative to distributed locks.

Instead of any node being able to write any key:

assign keys to partitions
route all writes for a key to one owner (or one partition consumer)

Common implementations:

consistent hashing + sticky routing
Kafka partition per userId (event-sourced updates)
sharded services by key range

This approach tends to be:

scalable
predictable
resilient

…and it’s “future-ready” because adding nodes increases partitions/owners rather than increasing conflict.

5) Consider commutative data models when possible

If operations commute (order doesn’t matter), you reduce coordination needs.

Examples:

counters (add/sub)
sets (add/remove with rules)
CRDT-style models (when domain allows)

Not always applicable—but when it is, it’s a long-term win.

Decision guide: choose the right tool

Single node

Low contention / read-heavy: optimistic locking
High contention on same rows: row locks or app-level serialization
Hot keys: app-level per-key serialization (striped locks / actor model)

Multiple nodes

Need strict mutual exclusion: distributed lock (+ fencing)
Long-running workflow across services: saga
Strict atomic commit across systems: 2PC (rare)

A production checklist you can keep forever

Before you pick a strategy, define:

What invariants must never break?
Is eventual consistency acceptable?
What’s the contention profile (read-heavy vs write-heavy, hot keys)?
What happens under retries and timeouts?

If you use optimistic locking:

bounded retries + jitter
measure conflict rate
don’t do heavy computation before the versioned update unless needed

If you use DB row locks:

keep transactions short
lock order consistent
timeouts enabled
avoid remote calls inside lock scope

If you use app-level serialization:

ensure all writes go through it
prevent memory leaks (striped locks / bounded queues)
implement backpressure (queue limits)
plan for multi-node (sticky routing / partitioned ownership)

If you use distributed locks:

use leases and renewal carefully
implement fencing tokens
make operations idempotent

If you use sagas:

outbox + idempotent consumers
clear state machine
compensation semantics defined and tested

Closing takeaway

Concurrency control is not about choosing “optimistic vs pessimistic.”

It’s about answering one first-principles question:

Where do we serialize updates, and how do we behave under failure and load?

Pick the serialization point intentionally, and your system will stay correct as it grows.
Ignore it, and production traffic will force serialization in the worst place—usually through timeouts, retries, and outages.

Concurrency Control from First Principles

Why you should care (even if things “work fine” today)

First principles: What is a race condition?

A simple invariant example

The classic failure mode: Lost Update

Running example we’ll use throughout

Real-life analogy (keep this in mind)

“Two people editing a contract”

Part 1: Single node — 3 practical strategies

1) Database row lock + transaction isolation

Principle: “Make the update mutually exclusive by locking the row.”

Example: Pessimistic row locking

What happens under contention?

Real-life analogy

Pros

Cons (this matters in production)

Production-grade best practices

Isolation levels (quick intuition, not theory-heavy)

2) MVCC / Optimistic locking (version check)

Principle: “Let everyone run; detect conflicts when writing.”

The pattern

What happens under contention?

Real-life analogy

Pros

Cons (the hidden killer)

Future-ready retry strategy (do this, always)

What to monitor (so you can evolve safely later)

3) Application-level pessimistic locking (serialize by key)

Principle: “Enforce one-writer-per-entity before hitting the DB.”

Pattern A: Striped locks (simple, efficient)

Pattern B: Per-key queue / “actor” model (strong ordering)

Real-life analogy

Pros

Cons

Why “3-way contention” often shows on a single node

Part 2: Multiple nodes — 3 distributed strategies

1) Distributed lock

Principle: “Put the lock in a shared coordinator.”

Real-life analogy

The critical future-ready detail: Fencing tokens

Pros

Cons

2) Saga pattern

Principle: “Don’t lock globally; use local transactions + compensation.”

Two styles

Real-life analogy

Production-ready requirements (this is what makes it future-proof)

Pros

Cons

3) Two-phase commit (2PC)

Principle: “All participants agree to commit or none do.”

Real-life analogy

Pros

Cons (why it’s not future-friendly at scale)

Why multi-node often looks like “2-way contention”

Future-ready patterns that age well

1) Idempotency is non-negotiable

2) Backpressure beats retries

3) Observability: measure contention, not just errors

4) Partitioned ownership: “single writer per key across the fleet”

5) Consider commutative data models when possible

Decision guide: choose the right tool

Single node

Multiple nodes

A production checklist you can keep forever

Before you pick a strategy, define:

If you use optimistic locking:

If you use DB row locks:

If you use app-level serialization:

If you use distributed locks:

If you use sagas:

Closing takeaway

Comments

Post a Comment

Popular posts from this blog

CAP Theorem, Explained: Why Distributed Systems Can’t Have It All

Ensuring Missing Resources Are Created Automatically in a Spring Boot Project