Concurrency Control from First Principles

3 strategies on a single node, 3 strategies across multiple nodes — with real-life examples, production patterns, and future-ready guidance

Credits / Acknowledgements
This blog is based on detailed discussions and whiteboarding sessions with Sourabh Kumar Banka and Jatin Goyal.


Why you should care (even if things “work fine” today)

Race conditions don’t usually show up in development. They show up when:

  • traffic spikes,

  • retries kick in,

  • background jobs overlap,

  • autoscaling adds more instances,

  • latency increases (so overlaps happen more often).

If you’re building modern systems (cloud, microservices, async workflows, distributed caches), you’re going to face concurrency whether you like it or not.

The core idea to remember:

Every correct system serializes updates somewhere.
Your design decision is where that serialization happens and what trade-offs you accept.


First principles: What is a race condition?

A race condition exists when all three are true:

  1. Shared mutable state
    Something that can change: a DB row, cache entry, session, balance, inventory count.

  2. Concurrent actors
    Two or more threads/processes/nodes can operate “at the same time.”

  3. A non-atomic update
    The update happens as:

Read → Compute → Write

If two actors do this concurrently, you can break invariants.

A simple invariant example

“In USER(id=1), the final name should reflect the last accepted update, not a random overwrite.”

The classic failure mode: Lost Update

Two requests read the same initial value, then one overwrites the other.


Running example we’ll use throughout

A USER table with a version column:

USER(
  id BIGINT PRIMARY KEY,
  name TEXT,
  phone TEXT,
  version BIGINT NOT NULL
)

Two requests arrive roughly together:

  • Request A: set name = "jating" for id=1

  • Request B: set name = "jatink" for id=1

Both start by reading:

SELECT id, name, phone, version FROM user WHERE id = 1;

Now the question is: how do we ensure these updates don’t corrupt state or overwrite each other incorrectly?


Real-life analogy (keep this in mind)

“Two people editing a contract”

  • If both edit without coordination, one person’s changes get overwritten.

  • If they use a checkout/lock, only one edits at a time.

  • If they use a version number, the system can reject stale edits.

That’s exactly what we’re doing—just with databases and services.


Part 1: Single node — 3 practical strategies

A “single node” means one service instance (one JVM / one container). Concurrency is mostly threads and async tasks.

1) Database row lock + transaction isolation

Principle: “Make the update mutually exclusive by locking the row.”

When you lock, you force a single order of writes.
You serialize at the database layer.

Example: Pessimistic row locking

BEGIN;

SELECT id, name, phone, version
FROM user
WHERE id = 1
FOR UPDATE; -- row lock acquired here

-- compute changes in application

UPDATE user
SET name = 'jatink'
WHERE id = 1;

COMMIT;

What happens under contention?

  • First transaction gets the lock.

  • Other transactions trying to lock the same row wait.

  • You get strict ordering: T1 then T2 then T3.

Real-life analogy

A bank locker has one key. Whoever holds it can change contents. Others wait.

Pros

  • Strong correctness

  • No “retry storms” (others block, they don’t thrash)

  • Simple to reason about for single-row updates

Cons (this matters in production)

  • Blocking increases latency under contention

  • Deadlocks possible if multiple rows are locked in different orders

  • Throughput can drop if locks are held too long

Production-grade best practices

  • Keep transactions short. Do not call external APIs inside a transaction holding locks.

  • Lock rows in a consistent order (e.g., ascending by id) when locking multiple rows.

  • Use lock timeouts to avoid infinite waiting.

  • For work-queue patterns, consider:

    • FOR UPDATE SKIP LOCKED (process jobs without competing for the same row)

    • FOR UPDATE NOWAIT (fail fast if lock isn’t available)

Isolation levels (quick intuition, not theory-heavy)

  • READ COMMITTED: common default; can still see changes between reads.

  • REPEATABLE READ: stable view within a transaction (varies by DB details).

  • SERIALIZABLE: strongest; closest to “as-if one at a time,” usually slower.

Rule of thumb:

  • Use stronger isolation only when you can’t protect invariants another way.

  • Serializable is powerful but can become a bottleneck quickly.


2) MVCC / Optimistic locking (version check)

Principle: “Let everyone run; detect conflicts when writing.”

Optimistic locking assumes:

  • conflicts are rare

  • reads are common

  • blocking is expensive

The pattern

  1. Read the row with its version:

SELECT id, name, phone, version
FROM user
WHERE id = 1;  -- suppose version = 7
  1. Attempt update only if version still matches:

UPDATE user
SET name = 'jatink',
    version = version + 1
WHERE id = 1
  AND version = 7;

If UPDATE affects 0 rows → someone else updated it first → conflict.

What happens under contention?

  • Many threads read version 7 simultaneously.

  • Only one wins and increments to 8.

  • Everyone else fails and retries.

Real-life analogy

You pick a product off a shelf with a price tag (version).
At checkout, if the price tag changed, the cashier rejects it and you must re-check the price.

Pros

  • No blocking (better latency when conflicts are rare)

  • Excellent for read-heavy systems

  • Works well with caching and scale

Cons (the hidden killer)

  • Under hot-key write contention, it becomes:

    • wasted CPU (everyone computes)

    • wasted DB calls (failed updates)

    • “retry storms” that amplify load

Future-ready retry strategy (do this, always)

Bounded retries with backoff + jitter:

  • Max attempts: 3–5

  • Exponential backoff

  • Add random jitter

  • After that, return a clear conflict response (or queue the update)

Pseudo:

for attempt in 1..MAX:
  success = updateWhereVersionMatches()
  if success:
    return OK
  sleep(baseMs * 2^attempt + random(0..jitterMs))
return CONFLICT (or enqueue)

What to monitor (so you can evolve safely later)

  • optimistic lock failure rate

  • retry counts

  • p95/p99 latency during contention

  • DB CPU and lock metrics

If conflicts rise over time, you’ll need to shift strategies (often to key-based serialization).


3) Application-level pessimistic locking (serialize by key)

Principle: “Enforce one-writer-per-entity before hitting the DB.”

If your real contention is:
the same user/session/account is updated frequently,”
then serialize on that key in the application.

Pattern A: Striped locks (simple, efficient)

Instead of creating one lock per user (memory leak risk), use a fixed array of locks and map keys to a stripe:

import java.util.concurrent.locks.Lock;
import java.util.concurrent.locks.ReentrantLock;

public final class StripedUserLock {
  private final Lock[] stripes;

  public StripedUserLock(int stripesCount) {
    this.stripes = new Lock[stripesCount];
    for (int i = 0; i < stripesCount; i++) stripes[i] = new ReentrantLock();
  }

  public Lock lockFor(long userId) {
    int idx = (int) (Math.floorMod(userId, stripes.length));
    return stripes[idx];
  }
}

Usage:

Lock lock = stripedLock.lockFor(userId);
lock.lock();
try {
  // read -> compute -> write for this user
} finally {
  lock.unlock();
}

Pattern B: Per-key queue / “actor” model (strong ordering)

Route updates to a single-thread executor per partition:

hash(userId) → partition → single worker thread → DB update

This gives you:

  • strict ordering per user

  • no retry storms

  • stable behavior under load

Real-life analogy

A dedicated clerk handles one customer account at a time.
Other requests for that account wait in a line, not in the database.

Pros

  • Eliminates retry storms for hot keys

  • Reduces DB lock pressure

  • Predictable ordering and latency

  • Often the best option for “hot entity” systems

Cons

  • On a single node, it’s great—but across nodes you need coordination (next section)

  • Requires architecture discipline (all writes must go through the serializer)


Why “3-way contention” often shows on a single node

Because within one JVM, you may have 3 threads all doing:

  • read

  • compute

  • try to write

With optimistic locking, all 3 can execute simultaneously, then 2 fail → it feels like “3-way” active contention.

With row locks, only one runs while others wait; you still have 3 contenders, but only one active.


Part 2: Multiple nodes — 3 distributed strategies

Now imagine your service runs as 5 pods behind a load balancer. The same user update can hit different pods.

Now the challenge is:

How do we serialize updates across machines that can fail independently?

This is where distributed systems get hard.


1) Distributed lock

Principle: “Put the lock in a shared coordinator.”

You store the “key” in a system everyone can see:

  • Redis (with care)

  • etcd / ZooKeeper / Consul

  • DB advisory locks / lock tables

Flow:

Acquire lock(userId)
  if acquired:
    update DB
    release lock
  else:
    wait / retry / fail fast

Real-life analogy

One shared booking system for a meeting room across multiple offices.

The critical future-ready detail: Fencing tokens

Distributed locks can fail in subtle ways:

  • Node A gets lock.

  • Node A pauses (GC/network).

  • Lock expires.

  • Node B gets lock.

  • Node A resumes and still tries to write (stale owner).

Fencing tokens solve this:

  • Every lock acquisition returns an increasing token.

  • Your write includes the token.

  • DB rejects writes with older tokens.

Conceptually:

lock(userId) returns token=42
write must include token=42
DB rejects token < latestToken

If you use distributed locks, fencing tokens (or an equivalent “I am still the valid owner” mechanism) are what make it future-proof.

Pros

  • Strong mutual exclusion across nodes

  • Simple conceptual model

Cons

  • Adds latency

  • Coordinator can become a bottleneck

  • Needs careful correctness handling (leases, fencing, idempotency)


2) Saga pattern

Principle: “Don’t lock globally; use local transactions + compensation.”

Sagas are for multi-step workflows across services, where distributed locking would be too slow or too fragile.

Example: user offboarding across services

  1. Disable user login (Auth service)

  2. Revoke tokens/sessions (Session service)

  3. Remove PII (Profile service)

  4. Notify downstream systems (Events)

If step 3 fails, you compensate (or mark as partial and retry).

Two styles

  • Choreography: services react to events

  • Orchestration: a saga orchestrator drives the steps

Real-life analogy

Planning a trip:

  • If hotel booking fails, cancel flight.
    You didn’t lock every provider; you compensated when something failed.

Production-ready requirements (this is what makes it future-proof)

  • Idempotency on every step (retries are normal in distributed systems)

  • Outbox pattern (persist event + local state atomically, then publish)

  • Inbox/deduplication for consumers

  • Strong observability: step state, retries, compensations

Pros

  • Scales extremely well

  • No global locks

  • Handles long-running operations cleanly

Cons

  • Eventual consistency (not instantly consistent)

  • Compensation logic can be complex

  • Requires discipline (idempotency, state machine)


3) Two-phase commit (2PC)

Principle: “All participants agree to commit or none do.”

2PC gives strong atomicity across multiple systems:

  1. Prepare phase: “Can you commit?”

  2. Commit phase: “Commit now.”

Real-life analogy

Escrow: the deal closes only if all parties confirm.

Pros

  • Strong consistency and atomicity

Cons (why it’s not future-friendly at scale)

  • Can block if coordinator fails

  • Adds latency

  • Poor scalability under high throughput

  • Operationally complex

Use 2PC only when strict atomicity is mandatory and throughput is not the priority.


Why multi-node often looks like “2-way contention”

In many distributed data systems, each key has a “primary owner.” Writes funnel through that owner or through a single serialization point.

So contention looks like:

  • many clients trying to update

  • one owner coordinating
    → “clients vs owner” feels like 2-way contention.


Future-ready patterns that age well

If you want this blog to remain useful as systems scale, these patterns are the ones that keep winning.

1) Idempotency is non-negotiable

In distributed systems, duplicates happen:

  • retries

  • timeouts

  • at-least-once delivery

Every write path should support an idempotency key:

  • “Apply operation X exactly once (or safely multiple times).”

  • Store the key + outcome.

2) Backpressure beats retries

Unbounded retries can destroy systems under load.

Future-ready design includes:

  • bounded retries

  • queueing

  • rate limits

  • circuit breakers

  • load shedding

3) Observability: measure contention, not just errors

Track:

  • lock wait time

  • optimistic lock conflict rate

  • retries per request

  • deadlocks

  • queue depth (if using serialization)

  • saga compensation counts

  • p95/p99 latency

If you can’t see contention, you can’t improve it safely.

4) Partitioned ownership: “single writer per key across the fleet”

This is often the most scalable alternative to distributed locks.

Instead of any node being able to write any key:

  • assign keys to partitions

  • route all writes for a key to one owner (or one partition consumer)

Common implementations:

  • consistent hashing + sticky routing

  • Kafka partition per userId (event-sourced updates)

  • sharded services by key range

This approach tends to be:

  • scalable

  • predictable

  • resilient

…and it’s “future-ready” because adding nodes increases partitions/owners rather than increasing conflict.

5) Consider commutative data models when possible

If operations commute (order doesn’t matter), you reduce coordination needs.

Examples:

  • counters (add/sub)

  • sets (add/remove with rules)

  • CRDT-style models (when domain allows)

Not always applicable—but when it is, it’s a long-term win.


Decision guide: choose the right tool

Single node

  • Low contention / read-heavy: optimistic locking

  • High contention on same rows: row locks or app-level serialization

  • Hot keys: app-level per-key serialization (striped locks / actor model)

Multiple nodes

  • Need strict mutual exclusion: distributed lock (+ fencing)

  • Long-running workflow across services: saga

  • Strict atomic commit across systems: 2PC (rare)


A production checklist you can keep forever

Before you pick a strategy, define:

  • What invariants must never break?

  • Is eventual consistency acceptable?

  • What’s the contention profile (read-heavy vs write-heavy, hot keys)?

  • What happens under retries and timeouts?

If you use optimistic locking:

  • bounded retries + jitter

  • measure conflict rate

  • don’t do heavy computation before the versioned update unless needed

If you use DB row locks:

  • keep transactions short

  • lock order consistent

  • timeouts enabled

  • avoid remote calls inside lock scope

If you use app-level serialization:

  • ensure all writes go through it

  • prevent memory leaks (striped locks / bounded queues)

  • implement backpressure (queue limits)

  • plan for multi-node (sticky routing / partitioned ownership)

If you use distributed locks:

  • use leases and renewal carefully

  • implement fencing tokens

  • make operations idempotent

If you use sagas:

  • outbox + idempotent consumers

  • clear state machine

  • compensation semantics defined and tested


Closing takeaway

Concurrency control is not about choosing “optimistic vs pessimistic.”

It’s about answering one first-principles question:

Where do we serialize updates, and how do we behave under failure and load?

Pick the serialization point intentionally, and your system will stay correct as it grows.
Ignore it, and production traffic will force serialization in the worst place—usually through timeouts, retries, and outages.

Comments

Popular posts from this blog

CAP Theorem, Explained: Why Distributed Systems Can’t Have It All

Ensuring Missing Resources Are Created Automatically in a Spring Boot Project