Concurrency Control from First Principles

 

Concurrency Control from First Principles

3 Strategies on a Single Node, 3 Across Multiple Nodes — with Real-Life Analogies and Future-Ready Patterns

Credits / Acknowledgements
This article is based on deep technical discussions and whiteboarding sessions with Sourabh Kumar Banka and Jatin Goyal.


Why This Matters (Now and in the Future)

Most real production failures are not caused by wrong business logic. They are caused by incorrect ordering of updates.

As systems scale — microservices, distributed caches, cloud-native deployments, async retries, autoscaling — concurrency issues increase, not decrease.

If you remember only one thing from this article:

Concurrency control is about deciding where updates become ordered (serialized) — and intentionally paying the right trade-off.

Every correct system enforces order somewhere:

  • Database

  • Application

  • Distributed coordinator

  • Event log

  • Workflow engine

If you don’t choose where, contention will choose for you.


First Principles: Why Race Conditions Exist

A race condition requires three ingredients:

  1. Shared mutable state
    A database row, cache entry, account balance, session object.

  2. Concurrent actors
    Threads, processes, containers, retries, background jobs.

  3. Non-atomic update
    The operation happens as:

Read → Compute → Write

When two actors execute this pattern simultaneously, invariants break.


Concrete Example

USER(id, name, phone, version)

Two requests update id = 1:

  • Request A → name = "jating"

  • Request B → name = "jatink"

If both read "jatin" and both write, one update is lost.


Real-Life Analogy

Two people editing the same document offline:

  • Both download it.

  • Both edit.

  • Both upload.

Without coordination or version checks, one set of changes disappears.


Part 1 — Single Node: 3 Core Strategies

On a single node, concurrency usually means multiple threads inside one service instance.


1️⃣ Row Locks + Transaction Isolation

“Hold the key while you update”

First principle: Make the update mutually exclusive by locking the row.

BEGIN;

SELECT *
FROM user
WHERE id = 1
FOR UPDATE;

UPDATE user
SET name = 'jatink'
WHERE id = 1;

COMMIT;

What Happens

  • First transaction locks row.

  • Others wait.

  • Updates are serialized.

Real-Life Analogy

There is only one key to a secure room.
If you want to rearrange it, you must hold the key. Others wait.

Pros

  • Strong correctness

  • No retries needed

  • Simple mental model

Cons

  • Blocking under contention

  • Deadlocks if multiple rows locked

  • Throughput degrades with long transactions

Future-Ready Guidance

  • Keep critical sections extremely short.

  • Never hold DB locks while calling external systems.

  • Monitor lock wait time as a first-class metric.


2️⃣ MVCC / Optimistic Locking

“Don’t block; detect conflicts”

First principle: Assume conflicts are rare. Allow concurrent execution and detect conflict at write time.

SELECT id, version
FROM user
WHERE id = 1;

UPDATE user
SET name = 'jatink',
    version = version + 1
WHERE id = 1
  AND version = 7;

If zero rows updated → conflict → retry or fail.

Real-Life Analogy

Google Docs warns:

“This file was modified since you opened it.”

You merge or retry.

Pros

  • No blocking

  • High throughput for read-heavy systems

  • Scales well when contention is low

Cons

  • Retry storms under high contention

  • Wasted CPU work

  • Can overload DB if misused

Production-Ready Retry Strategy

  • Max 3–5 retries

  • Exponential backoff

  • Add jitter

  • Return conflict after threshold

Future-Ready Guidance


3️⃣ Application-Level Serialization

“Single writer per key”

First principle: Serialize updates before reaching the database.

Approaches:

  • Striped locks (hash(key) → lock)

  • Actor model (per-key queue)

  • Partitioned executor

Conceptually:

hash(userId) → partition → single worker → DB write

Real-Life Analogy

A dedicated clerk handles all changes for a specific account.
No two clerks edit the same account simultaneously.

Pros

  • Eliminates retry storms

  • Reduces DB lock pressure

  • Predictable ordering

  • Stable latency under contention

Cons

  • Works only per node unless combined with routing

  • Requires architectural discipline

Future-Ready Guidance

To scale across nodes:

  • Use consistent hashing + sticky routing

  • Or partition via an event log (Kafka-style)

  • Or assign key ownership per node

This pattern is highly scalable and often superior to DB locking for hot entities.


Part 2 — Multi-Node Systems: 3 Distributed Strategies

When multiple nodes are involved, concurrency is harder:

  • Nodes crash

  • Networks partition

  • Clocks drift

  • GC pauses happen

Now the question becomes:

How do we enforce ordering across machines?


1️⃣ Distributed Lock

“Shared key everyone agrees on”

Use Redis/etcd/ZooKeeper/DB advisory locks.

Acquire lock(key)
  if success → update
  else → wait/retry/fail

Real-Life Analogy

Shared meeting room booking calendar.
Everyone consults the same system.

Critical Future-Proof Detail: Fencing Tokens

Locks can expire. Nodes can pause.

To avoid stale writers:

  • Each lock acquisition returns a monotonically increasing token.

  • Database rejects writes with older tokens.

Without fencing, distributed locks can corrupt data.

Use When

  • You need single-writer semantics across nodes.

  • Contention is moderate.

Avoid When

  • Ultra high throughput needed.

  • Lock server becomes bottleneck.


2️⃣ Saga Pattern

“Commit locally, compensate globally”

First principle: Instead of locking everything, break workflow into steps and undo on failure.

Example:

  1. Create user

  2. Provision wallet

  3. Send email

If wallet fails → disable user (compensation).

Real-Life Analogy

Booking travel.
If hotel fails, cancel flight.

Production-Ready Saga Requirements

  • Idempotent steps

  • Outbox pattern for reliable events

  • Deduplication / inbox pattern

  • Clear state transitions

Use When

  • Cross-service workflows

  • Long-running operations

  • Eventual consistency acceptable


3️⃣ Two-Phase Commit (2PC)

“All commit or none commit”

Coordinator asks all participants:

  1. Prepare?

  2. Commit.

Real-Life Analogy

Escrow closing: funds and documents must align.

Pros

  • Strong atomicity

Cons

  • Blocking protocol

  • Coordinator failure risk

  • High latency

  • Poor scalability at scale

Use Sparingly

Only where strict atomicity is legally or financially required.


Making This Future-Ready

Modern distributed systems introduce additional challenges:

1️⃣ Idempotency Everywhere

  • Every external call must be safe to retry.

  • Use idempotency keys.

2️⃣ Observability

Track:

  • Lock wait time

  • Conflict rate

  • Retry count

  • Deadlocks

  • Saga compensations

Concurrency problems hide without metrics.

3️⃣ Backpressure and Load Shedding

Unbounded retries destroy systems.
Apply limits and fail gracefully.

4️⃣ Partitioned Ownership (Highly Scalable Model)

Instead of global locking:

  • Assign key ownership.

  • Route updates by consistent hashing.

  • Treat partitions as single-writer streams.

This model scales horizontally and avoids distributed locking.

5️⃣ Consider CRDTs (When Applicable)

For some data types (counters, sets, collaborative docs), conflict-free replicated data types remove the need for coordination entirely.

But they require careful domain modeling.


Decision Framework

Single Node

SituationStrategy
Low contentionOptimistic locking
High contentionRow lock
Hot key patternApp-level serialization

Multiple Nodes

SituationStrategy
Need strict mutual exclusionDistributed lock (+ fencing)
Business workflowSaga
Strong atomic commit required2PC

Final Takeaway

Every concurrency solution enforces ordering somewhere:

  • Database

  • Application

  • Coordinator

  • Partitioned log

  • Workflow engine

Your job is not to eliminate contention.

Your job is to decide:

  • Where ordering lives

  • What trade-off you accept

  • How the system behaves under extreme load

Design intentionally — or production traffic will design it for you.


If you'd like, I can convert this into:

  • A polished Medium/LinkedIn article version

  • A PDF-ready version

  • Or add architecture diagrams and code samples for a specific tech stack

Comments

Popular posts from this blog

CAP Theorem, Explained: Why Distributed Systems Can’t Have It All

Concurrency Control from First Principles

Ensuring Missing Resources Are Created Automatically in a Spring Boot Project