Concurrency Control from First Principles
Concurrency Control from First Principles
3 Strategies on a Single Node, 3 Across Multiple Nodes — with Real-Life Analogies and Future-Ready Patterns
Credits / Acknowledgements
This article is based on deep technical discussions and whiteboarding sessions with Sourabh Kumar Banka and Jatin Goyal.
Why This Matters (Now and in the Future)
Most real production failures are not caused by wrong business logic. They are caused by incorrect ordering of updates.
As systems scale — microservices, distributed caches, cloud-native deployments, async retries, autoscaling — concurrency issues increase, not decrease.
If you remember only one thing from this article:
Concurrency control is about deciding where updates become ordered (serialized) — and intentionally paying the right trade-off.
Every correct system enforces order somewhere:
Database
Application
Distributed coordinator
Event log
Workflow engine
If you don’t choose where, contention will choose for you.
First Principles: Why Race Conditions Exist
A race condition requires three ingredients:
Shared mutable state
A database row, cache entry, account balance, session object.Concurrent actors
Threads, processes, containers, retries, background jobs.Non-atomic update
The operation happens as:
Read → Compute → Write
When two actors execute this pattern simultaneously, invariants break.
Concrete Example
USER(id, name, phone, version)
Two requests update id = 1:
Request A → name = "jating"
Request B → name = "jatink"
If both read "jatin" and both write, one update is lost.
Real-Life Analogy
Two people editing the same document offline:
Both download it.
Both edit.
Both upload.
Without coordination or version checks, one set of changes disappears.
Part 1 — Single Node: 3 Core Strategies
On a single node, concurrency usually means multiple threads inside one service instance.
1️⃣ Row Locks + Transaction Isolation
“Hold the key while you update”
First principle: Make the update mutually exclusive by locking the row.
BEGIN;
SELECT *
FROM user
WHERE id = 1
FOR UPDATE;
UPDATE user
SET name = 'jatink'
WHERE id = 1;
COMMIT;
What Happens
First transaction locks row.
Others wait.
Updates are serialized.
Real-Life Analogy
There is only one key to a secure room.
If you want to rearrange it, you must hold the key. Others wait.
Pros
Strong correctness
No retries needed
Simple mental model
Cons
Blocking under contention
Deadlocks if multiple rows locked
Throughput degrades with long transactions
Future-Ready Guidance
Keep critical sections extremely short.
Never hold DB locks while calling external systems.
Monitor lock wait time as a first-class metric.
2️⃣ MVCC / Optimistic Locking
“Don’t block; detect conflicts”
First principle: Assume conflicts are rare. Allow concurrent execution and detect conflict at write time.
SELECT id, version
FROM user
WHERE id = 1;
UPDATE user
SET name = 'jatink',
version = version + 1
WHERE id = 1
AND version = 7;
If zero rows updated → conflict → retry or fail.
Real-Life Analogy
Google Docs warns:
“This file was modified since you opened it.”
You merge or retry.
Pros
No blocking
High throughput for read-heavy systems
Scales well when contention is low
Cons
Retry storms under high contention
Wasted CPU work
Can overload DB if misused
Production-Ready Retry Strategy
Max 3–5 retries
Exponential backoff
Add jitter
Return conflict after threshold
Future-Ready Guidance
Track conflict rate as a metric.
Implement idempotency keys for external effects.
Avoid optimistic locking for hot keys with high write frequency.
3️⃣ Application-Level Serialization
“Single writer per key”
First principle: Serialize updates before reaching the database.
Approaches:
Striped locks (hash(key) → lock)
Actor model (per-key queue)
Partitioned executor
Conceptually:
hash(userId) → partition → single worker → DB write
Real-Life Analogy
A dedicated clerk handles all changes for a specific account.
No two clerks edit the same account simultaneously.
Pros
Eliminates retry storms
Reduces DB lock pressure
Predictable ordering
Stable latency under contention
Cons
Works only per node unless combined with routing
Requires architectural discipline
Future-Ready Guidance
To scale across nodes:
Use consistent hashing + sticky routing
Or partition via an event log (Kafka-style)
Or assign key ownership per node
This pattern is highly scalable and often superior to DB locking for hot entities.
Part 2 — Multi-Node Systems: 3 Distributed Strategies
When multiple nodes are involved, concurrency is harder:
Nodes crash
Networks partition
Clocks drift
GC pauses happen
Now the question becomes:
How do we enforce ordering across machines?
1️⃣ Distributed Lock
“Shared key everyone agrees on”
Use Redis/etcd/ZooKeeper/DB advisory locks.
Acquire lock(key)
if success → update
else → wait/retry/fail
Real-Life Analogy
Shared meeting room booking calendar.
Everyone consults the same system.
Critical Future-Proof Detail: Fencing Tokens
Locks can expire. Nodes can pause.
To avoid stale writers:
Each lock acquisition returns a monotonically increasing token.
Database rejects writes with older tokens.
Without fencing, distributed locks can corrupt data.
Use When
You need single-writer semantics across nodes.
Contention is moderate.
Avoid When
Ultra high throughput needed.
Lock server becomes bottleneck.
2️⃣ Saga Pattern
“Commit locally, compensate globally”
First principle: Instead of locking everything, break workflow into steps and undo on failure.
Example:
Create user
Provision wallet
Send email
If wallet fails → disable user (compensation).
Real-Life Analogy
Booking travel.
If hotel fails, cancel flight.
Production-Ready Saga Requirements
Idempotent steps
Outbox pattern for reliable events
Deduplication / inbox pattern
Clear state transitions
Use When
Cross-service workflows
Long-running operations
Eventual consistency acceptable
3️⃣ Two-Phase Commit (2PC)
“All commit or none commit”
Coordinator asks all participants:
Prepare?
Commit.
Real-Life Analogy
Escrow closing: funds and documents must align.
Pros
Strong atomicity
Cons
Blocking protocol
Coordinator failure risk
High latency
Poor scalability at scale
Use Sparingly
Only where strict atomicity is legally or financially required.
Making This Future-Ready
Modern distributed systems introduce additional challenges:
1️⃣ Idempotency Everywhere
Every external call must be safe to retry.
Use idempotency keys.
2️⃣ Observability
Track:
Lock wait time
Conflict rate
Retry count
Deadlocks
Saga compensations
Concurrency problems hide without metrics.
3️⃣ Backpressure and Load Shedding
Unbounded retries destroy systems.
Apply limits and fail gracefully.
4️⃣ Partitioned Ownership (Highly Scalable Model)
Instead of global locking:
Assign key ownership.
Route updates by consistent hashing.
Treat partitions as single-writer streams.
This model scales horizontally and avoids distributed locking.
5️⃣ Consider CRDTs (When Applicable)
For some data types (counters, sets, collaborative docs), conflict-free replicated data types remove the need for coordination entirely.
But they require careful domain modeling.
Decision Framework
Single Node
| Situation | Strategy |
|---|---|
| Low contention | Optimistic locking |
| High contention | Row lock |
| Hot key pattern | App-level serialization |
Multiple Nodes
| Situation | Strategy |
|---|---|
| Need strict mutual exclusion | Distributed lock (+ fencing) |
| Business workflow | Saga |
| Strong atomic commit required | 2PC |
Final Takeaway
Every concurrency solution enforces ordering somewhere:
Database
Application
Coordinator
Partitioned log
Workflow engine
Your job is not to eliminate contention.
Your job is to decide:
Where ordering lives
What trade-off you accept
How the system behaves under extreme load
Design intentionally — or production traffic will design it for you.
If you'd like, I can convert this into:
A polished Medium/LinkedIn article version
A PDF-ready version
Or add architecture diagrams and code samples for a specific tech stack
Comments
Post a Comment