Low Water Mark and High Water Mark in System Design

September 09, 2025

When designing large-scale distributed systems, we deal with thresholds all the time — memory, queues, replication logs, caches, and network buffers. Two key concepts that help maintain stability and efficiency are High Water Mark (HWM) and Low Water Mark (LWM).

They are critical in replication protocols, messaging queues, and quorum systems.

🔹 High Water Mark (HWM)

In system design, HWM is the maximum threshold at which the system takes corrective action to avoid overload or inconsistency.

It prevents overflow, ensures stability, and protects resources.

🔹 Low Water Mark (LWM)

The LWM is the safe lower threshold that signals the system has recovered enough to resume normal operations.

It prevents rapid toggling between "block/unblock" states (avoids thrashing).

🔹 Why Do We Need Both?

The gap between HWM and LWM (called hysteresis) ensures smooth control.

Without it, systems would keep toggling rapidly whenever they touch the threshold.

🔹 Applications in System Design

1. Replication (Databases & Logs)

In replication protocols like Raft or Kafka log replication:

High Water Mark (HWM):
The highest log entry known to be safely replicated across a quorum of followers.
- Leaders commit up to this point.
- Clients only see data beyond HWM as "committed."
Low Water Mark (LWM):
The oldest log entry that is still retained across replicas before it can be truncated or garbage collected.
- Ensures lagging replicas can still catch up.

👉 Example:
In Kafka:

HWM = last offset replicated to all in-sync replicas (ISR).
LWM = earliest offset that is still required (older logs may be deleted once all replicas pass this point).

2. Message Queues (Kafka, RabbitMQ, JMS)

Queues need flow control between producers and consumers.

HWM (queue length threshold):
If the queue size reaches this point, producers are throttled or blocked to avoid memory overflow.
LWM (queue recovery point):
Once the queue drains below this level, producers are unblocked and can resume publishing freely.

👉 Example:

Queue size = 1000 messages.
HWM = 800 → producers slow down/backpressure.
LWM = 400 → safe to resume.

3. Quorum Systems (Consensus Protocols)

In distributed systems (Raft, Paxos, Zookeeper, Cassandra), quorum decisions often depend on watermarks:

High Water Mark (HWM):
The highest operation acknowledged by a quorum of nodes (majority).
- Ensures durability and strong consistency.
- Example: Raft leader commits log entries only once HWM is replicated to a majority.
Low Water Mark (LWM):
The minimum point at which enough replicas have acknowledged operations.
- Used for safe cleanup of logs or state machine snapshots.
- Prevents removing entries still needed by lagging nodes.

👉 Example:
In Zookeeper/ZAB protocol:

HWM = last committed transaction agreed by quorum.
LWM = last transaction safe to garbage collect.

🔹 Visualization

Replication Logs Example
--------------------------------------------------
0   10   20   30   40   50   60   70   80   90  100
^                 ^                     ^
|                 |                     |
LWM           Safe Zone              HWM
(earliest     Operations        (highest fully
 kept logs)   in-flight range   committed entry)

🔹 Key Takeaways

Replication:
- HWM = last safely replicated log entry.
- LWM = earliest retained log for lagging nodes.
Queues:
- HWM = max queue threshold → apply backpressure.
- LWM = min queue threshold → resume producers.
Quorum:
- HWM = operation confirmed by majority → commit point.
- LWM = safe cleanup threshold.

👉 In short:

HWM protects against overload/inconsistency.
LWM ensures recovery & smooth operation.
Together, they balance safety, performance, and stability in distributed systems.

Would you like me to draw parallels between Kafka’s LWM/HWM and Raft’s quorum-based watermarks (side-by-side comparison), so the blog has a deeper distributed-systems flavor?

Search This Blog

System Design