A Robust Modbus Proxy: Reconnect & Stale-Cache

Q: My cache still freezes sometimes — why?

Almost always the missing break in the batch loop: if the connection dies but a read throws only a TimeoutError instead of a ConnectionError, the loop keeps running against the dead socket and the outer loop never rebuilds. Make sure a real connection failure leaves the batch pass. Second suspect: a read with no asyncio.wait_for at all — a single read without a timeout is enough to block the whole loop forever.

Table of Contents

The failure mode no beginner guide covers
Step 1 — per-batch timeout and the 50 ms spacing
Step 2 — fast-retry-then-backoff and the stale-cache warning
Step 3 — client idle timeout against dead connections
Configuration — never real LAN IPs
Numbers from long-term operation
Frequently asked questions

A self-built Modbus cache proxy runs for weeks without complaint in summer — until the first night the inverter shuts down, or the first firmware reboot of the SDongle. That's exactly when you find out whether you built a proxy or a time bomb. My first attempt was naive: poll, cache, serve. It worked perfectly during the day. At night, when the SUN2000 went to sleep, the poll loop hung in a read that never returned — and for hours the proxy silently served the last daytime values as if nothing was wrong.

That's the dangerous failure: not the crash (you notice that), but the proxy that keeps running and serves stale data while nobody is the wiser. This post is the reliability playbook that turned my proxy into something I trust. I build the proxy itself in the Modbus caching basics post — here it's purely about the robustness underneath.

The failure mode no beginner guide covers

Tutorials show the happy path: connect to the SDongle, read registers, done. What they leave out: the SDongle is slow, opinionated hardware. It drops the connection at night, it needs a minute after a firmware reboot before it answers again, and it can't handle reads fired back-to-back. A naive asyncio read without a timeout then blocks forever, and your cache freezes on its last value. HA dutifully keeps showing numbers — they're just no longer true.

The fix has three pillars. First, every single read gets a hard timeout and the spacing the SDongle needs. Second, the poll loop retries fast and then backs off, instead of hanging. Third — and this is the part almost nobody builds — the proxy makes its own staleness visible, so Home Assistant can alert on it.

Step 1 — per-batch timeout and the 50 ms spacing

The inverter is polled in register batches. Each batch gets its own timeout (in the read_batch helper, as an asyncio.wait_for), and there's a 50-millisecond pause between two reads — the SDongle is too slow to answer reads in quick succession and punishes haste with timeouts. The crucial bit is distinguishing the two exceptions: a single TimeoutError is survivable (one batch is missing this round), but any other exception means the connection is probably dead — then we leave the loop immediately via break instead of hammering a dead socket.

for start, count in REGISTER_BATCHES:
    try:
        values = await read_batch(reader, writer, start, count)
        if values:
            async with cache_lock:
                for i, val in enumerate(values):
                    register_cache[start + i] = val
        await asyncio.sleep(0.05)  # 50ms between reads — the SDongle is slow
    except asyncio.TimeoutError:
        fail_count += 1
    except Exception:
        fail_count += 1
        break  # connection probably dead, leave the loop

The break is the heart of it: a single timeout must not abort the batch pass — otherwise every passing glitch loses you half the register set. But a real ConnectionError or a torn-down stream has to end the pass, so the outer loop can build a fresh reconnect instead of reading blindly into the void.

Step 2 — fast-retry-then-backoff and the stale-cache warning

The outer reader loop decides how often to reconnect. The logic is deliberately asymmetric: after a successful poll we wait the normal POLL_INTERVAL (10 s). When a poll fails, we retry fast — capped at 10 s, so a brief hiccup is bridged in seconds without flooding the SDongle with reconnect attempts. And then the most important part: when the cache is older than 120 seconds, we write an explicit Cache stale warning to the log.

retry_delay = 5
while True:
    success = await read_sdongle()
    if success:
        last_update = time.time()
        retry_delay = POLL_INTERVAL
    else:
        retry_delay = min(retry_delay, 10)
        age = time.time() - last_update if last_update > 0 else -1
        if age > 120:
            logger.warning(f"Cache stale for {age:.0f}s")
    await asyncio.sleep(retry_delay)

That one log line is the difference between a proxy that lies and one that's honest. It makes staleness observable. In Home Assistant you can catch it on the consumer side: a sensor that hasn't updated for minutes flips to unavailable — and you can hang a push notification off that, just like any other anomaly (see the pattern in the PV string anomaly post).

Step 3 — client idle timeout against dead connections

The other side of the proxy is the clients (HA, evcc, a second dashboard). Without an idle timeout, dead client connections pile up — an HA restart, a crashed container, and the old socket sits open forever. Every Modbus request starts with a 7-byte MBAP header; we read it with a 60-second timeout. If nothing arrives, or fewer than 7 bytes, the client is gone and we close the connection cleanly.

header = await asyncio.wait_for(client_reader.read(7), timeout=60)
if len(header) < 7:
    break  # Client disconnected

60 seconds is generous — HA typically polls every 30–60 s, so a healthy connection easily outlives the timeout. A dead connection, on the other hand, never sends another header and gets reaped within a minute at most, instead of holding a slot and memory hostage.

Configuration — never real LAN IPs

All the values that change live at the top in a config block. Put in your own SDongle address — a DHCP-reserved LAN address is ideal so it doesn't drift. The port is 502 or 6607 depending on firmware. Never publish your real LAN IP in a gist or forum post; use placeholders, like here.

# Configuration (mit eigenen Werten ersetzen)
SDONGLE_HOST = "YOUR_SDONGLE_IP"   # z.B. eine DHCP-reservierte Adresse im LAN
SDONGLE_PORT = 502                 # oder 6607, je nach Firmware
DEVICE_ID = 1

SERVER_HOST = "0.0.0.0"
SERVER_PORT = 5502
POLL_INTERVAL = 10  # seconds

Numbers from long-term operation

These values were tuned over months of real operation with a Huawei SUN2000 SDongle. 50 ms spacing: below it timeouts piled up, above it the poll got sluggish. 10 s poll interval: fine enough for PV data and gentle on the SDongle. 120 s stale threshold: two missed polls plus headroom. 60 s client idle: covers any sane HA scan interval. Tune them to your hardware, but start here.

Frequently asked questions

Why not just use the ready-made ha-modbusproxy add-on?

You can — the add-on is good and takes the work off your hands. This post is for those who built their own proxy (or want to understand what happens underneath) and need to control the reliability layer themselves. The failure modes and thresholds here apply conceptually to any caching proxy, self-built or add-on alike.

My cache still freezes sometimes — why?

Almost always the missing break in the batch loop: if the connection dies but a read throws only a TimeoutError instead of a ConnectionError, the loop keeps running against the dead socket and the outer loop never rebuilds. Make sure a real connection failure leaves the batch pass. Second suspect: a read with no asyncio.wait_for at all — a single read without a timeout is enough to block the whole loop forever.

How do I alert on a stale cache in Home Assistant?

Easiest is via the sensor staleness itself: a Modbus sensor that stops getting fresh values goes unavailable after a few missed scans. Hang an automation off that with a state trigger to unavailable and a for duration of a few minutes to ride out brief dropouts. If you want it more explicit, build a template binary sensor that checks the age of the last update against a threshold.

Which SDongle port is correct — 502 or 6607?

Depends on the SDongle firmware. Older firmware often speaks the standard Modbus port 502; newer ones moved Modbus TCP to 6607 in places, or require you to enable it in the FusionSolar app first. Try 502 first; if the connect is refused outright, use 6607. If nothing happens at all, Modbus TCP on the dongle is probably still disabled.

A Robust Modbus Proxy: Reconnect, Stale-Cache Detection and Timeouts Done Right