Concurrency Control from First Principles

3 ways on a single node, 3 ways across multiple nodes — with real-life examples

Race conditions aren’t a “database problem” or a “threading problem”. They’re a physics-of-computing problem: two actors try to change the same thing, and time doesn’t give you a single, obvious order.

This blog explains concurrency using first principles, then maps that to the 3 most common strategies on a single node and the 3 most common strategies in a distributed (multi-node) system, with practical examples you can reuse.


First principles: what causes a race condition?

A race condition exists when all three are true:

  1. Shared mutable state
    Something can be changed (a DB row, cache entry, file, in-memory map).

  2. Concurrent actors
    Two+ threads/processes/nodes can touch it “at the same time”.

  3. Non-atomic read → compute → write
    The update is not a single indivisible step.

The classic shape is:

Read current state
Compute new state
Write new state

If two actors do this concurrently, you can violate invariants like:

  • “version must monotonically increase”

  • “balance can’t go below 0”

  • “a user’s latest phone number should be the last update we accepted”

Real-life analogy: two people editing the same Google Doc offline

  • Both download the document (read)

  • Both edit independently (compute)

  • Both upload (write)
    If the system doesn’t reconcile properly, one person’s changes overwrite the other’s.

So the fundamental question becomes:

Where will we enforce a single order of updates (serialization)?
In the DB? In the app? In a coordinator? In a workflow?

Every solution is just a different answer to that question.


A concrete example we’ll use throughout

A USER row with a version column:

USER(id, name, phone, version)

Two requests hit at the same time:

  • Request A: change name = "jating"

  • Request B: change name = "jatink"

Both target id = 1.


Part 1: Single node — 3 ways

On a single node, “actors” are typically threads (or async tasks) inside one process.

1) Row lock + transaction isolation

“Hold the key while you update”

First principle: Make the update mutually exclusive by locking the row, so only one transaction can modify it at a time.

Example (pessimistic row lock)

BEGIN;

SELECT id, name, phone, version
FROM user
WHERE id = 1
FOR UPDATE;   -- lock the row

-- compute changes

UPDATE user
SET name = 'jatink'
WHERE id = 1;

COMMIT;

What you get

  • Correctness: strong (you serialize writers)

  • Behavior under contention: others block

  • Cost: lock waits, potential deadlocks (if you lock multiple rows in different orders)

Real-life analogy: single key for a storage room

If you need to change what’s inside the room, you take the only key.
Others wait until you return it.

When it’s a good fit

  • High contention on a small set of rows (“hot users”)

  • You need strict ordering and can keep the transaction short

When it hurts

  • Long transactions (locks held too long)

  • Many rows locked → deadlocks and throughput collapse

Practical rule: if you choose row locks, keep the critical section tiny (no remote calls, no long computations while holding locks).


2) MVCC / Optimistic locking

“Don’t block; detect conflicts and retry”

First principle: Assume conflicts are rare. Let everyone proceed concurrently, then reject updates that used stale data.

This is usually implemented with a version column.

Example (optimistic version check)

-- Read
SELECT id, name, phone, version
FROM user
WHERE id = 1;          -- suppose version = 7

-- Write (only succeeds if version is still 7)
UPDATE user
SET name = 'jatink',
    version = version + 1
WHERE id = 1
  AND version = 7;

If UPDATE affects 0 rows → conflict happened → retry.

What you get

  • Correctness: strong (prevents lost updates)

  • Behavior under contention: no blocking, but retries

  • Cost: wasted work when conflicts are common

Real-life analogy: “Take a number” at a deli counter

You don’t block others from browsing.
But when it’s your turn, your ticket must still be valid. If the counter advanced, you might need a new ticket.

The hidden pitfall: retry storms

If the same row is updated constantly:

  • many threads read version 7

  • only one succeeds to version 8

  • everyone else retries → more load → more conflicts

This is why optimistic locking is amazing for read-heavy workloads and painful for hot write-heavy keys.

A sane retry policy (important)

Pseudo:

for attempt in 1..MAX:
  try updateWithVersionCheck()
  if success: return OK
  sleep(base * 2^attempt + random_jitter)
return CONFLICT

3) Application-level pessimistic locking

“Serialize by key before you reach the database”

First principle: If the real contention is “same user ID updated repeatedly”, then enforce order in the app by ensuring only one update for that user runs at a time.

Common patterns:

  • Striped locks (fixed number of locks; hash(userId) → lock)

  • Per-key single-thread executor / queue (actor model)

  • In-memory or distributed queue partitioning

Example: striped lock (single JVM)

// 64 stripes: constant memory, avoids per-user lock map
Lock[] stripes = new ReentrantLock[64];
for (int i = 0; i < stripes.length; i++) stripes[i] = new ReentrantLock();

Lock lockFor(long userId) {
  return stripes[(int)(Math.abs(userId) % stripes.length)];
}

void updateUser(long userId, Consumer<User> updater) {
  Lock l = lockFor(userId);
  l.lock();
  try {
    // read -> compute -> write safely for this userId stripe
    User u = repo.get(userId);
    updater.accept(u);
    repo.save(u);
  } finally {
    l.unlock();
  }
}

What you get

  • Correctness: strong (for that app instance)

  • Behavior under contention: one updater runs, others wait in-memory

  • Cost: doesn’t automatically protect you if you have multiple nodes (more on that below)

Real-life analogy: bouncer with a guest list

If you want to change your name on the list, the bouncer only handles one request per person at a time.
You don’t even get to the desk until you’re “your turn”.

When it’s a great fit

  • Hot-key updates (same user/session/account)

  • You control the write path (all updates go through this service)

  • You want to avoid DB lock pressure and retry storms


Why single node sometimes feels like “3-way contention”

On one node, you can easily have:

  • Thread A

  • Thread B

  • Thread C
    all trying to update the same row.

With optimistic locking, all 3 may do work and then 2 fail → it feels like 3-way contention.

With pessimistic row locks, 1 runs and 2 block → it’s still 3 actors, but only one active at a time.


Part 2: Multi-node — 3 ways

Now the actors aren’t just threads. They are independent processes on different machines. The difficulty increases because:

  • clocks aren’t perfectly aligned

  • networks partition

  • “who holds the lock” is itself a distributed state problem

The first-principles question becomes:

How do we enforce a single order across nodes?

1) Distributed lock

“One lock everyone agrees on”

First principle: Put the “key” in a shared, strongly coordinated place (Redis, ZooKeeper, etcd, DB advisory lock). Only the node holding the lock may update.

Typical shape:

Acquire lock(userId)
  if acquired:
    update
    release lock
  else:
    wait/retry/fail

Real-life analogy: booking a shared meeting room

Multiple offices can try to book the same room.
Everyone checks the same calendar system. Only one booking wins.

The big gotchas (must know)

Distributed locks can lie unless designed carefully:

  • Node gets lock, then pauses (GC / network hiccup)

  • lock expires

  • another node acquires lock

  • first node resumes and still thinks it owns the lock

Mitigation: fencing tokens
You attach a monotonically increasing token to each lock acquisition and make the DB reject writes with old tokens.

If you don’t have fencing (or equivalent), distributed locks are a common source of “it worked in staging” failures.

When to use

  • You truly need single-writer semantics across nodes

  • You can tolerate extra latency and operational complexity


2) Saga pattern

“Don’t lock globally; coordinate via steps + compensation”

First principle: For multi-service workflows, global locking kills scalability. Instead, make each step local and consistent, and handle failures with compensating actions.

Example saga for “create account + provision wallet + send welcome email”:

  1. Create user in User Service

  2. Provision wallet in Wallet Service

  3. Send welcome email

If step 2 fails → compensate step 1 (delete/disable user), or mark as “pending”.

Real-life analogy: planning a trip

You book:

  • flight

  • hotel

  • cab

If the hotel booking fails, you cancel the flight (compensation) instead of trying to hold locks on all providers.

What you get

  • Scalability: high

  • Consistency: eventual (not immediate)

  • Cost: complexity in compensation, idempotency, and observability

When to use


3) Two-phase commit (2PC)

“All participants agree to commit or none do”

First principle: Achieve atomic commit across multiple participants by introducing a coordinator:

  1. Prepare phase: “Can you commit?” (participants lock resources and reply yes/no)

  2. Commit phase: If all said yes, coordinator tells everyone to commit

Real-life analogy: escrow for a house purchase

The deal closes only when:

  • buyer’s funds are ready

  • seller’s documents are ready

  • legal checks pass
    Everyone commits together, or nobody does.

The trade-offs

  • Strong consistency, but…

  • Coordinator failure can block progress

  • High latency

  • Poor scalability for high-throughput systems

When to use

  • Rare cases where strict atomicity is mandatory and throughput is not the main goal
    (think: some financial/ledger systems)


Why multi-node often feels like “2-way contention”

In many distributed data systems (distributed DBs / caches), each key has a primary owner. Writes effectively serialize at the owner.

So contention looks like:

  • many clients

  • one owner coordinating
    → “clients vs owner” feels like 2-way.


A practical decision guide

Ask these questions in order:

1) Is this a single entity update or a multi-entity workflow?

  • Single row/entity → start with optimistic/pessimistic/app-level

  • Multi-service workflow → saga or (rarely) 2PC

2) What’s the contention profile?

  • Mostly reads, rare conflicts → optimistic (MVCC)

  • Frequent conflicting writes on same key → pessimistic row lock or app-level serialization

  • Hot-key across nodes → distributed lock or partitioned queue/owner model

3) What consistency do users/business require?

  • Must be immediate and strict → locking / 2PC

  • Can be eventual (with compensation) → saga


Cheat sheet

Single node

  • Row lock + transaction isolation: “block others; strong consistency; watch lock time”

  • MVCC / optimistic: “no blocking; retry on conflict; great for read-heavy”

  • App-level lock/queue: “serialize by key; best for hot keys; must consider multi-node”

Multiple nodes

  • Distributed lock: “single key across nodes; add fencing; operationally heavier”

  • Saga: “no global locks; eventual consistency; compensation required”

  • 2PC: “strongest atomicity; slow and can block; use sparingly”


Closing: the one idea to remember

Every race condition solution is just:

Choose where you serialize updates.
Then pay the price (blocking, retries, coordination overhead, or compensation complexity).

If you pick the serialization point intentionally—and align it with your contention profile—you’ll get both correctness and performance. If you don’t, contention will pick it for you… usually at the worst possible place.


If you want, I can also:

  • add diagrams (sequence diagrams for each strategy),

  • include a full Spring/JPA example with optimistic locking + retry/backoff,

  • or show a production-grade “striped executor” implementation for per-user ordering.

Comments

Popular posts from this blog

CAP Theorem, Explained: Why Distributed Systems Can’t Have It All

Concurrency Control from First Principles

Ensuring Missing Resources Are Created Automatically in a Spring Boot Project