Back to list
Development Update — June 7
The day the dmsg session lifecycle finally got fixed — the change that was the real root cause behind “dmsg error 202” and the per-hop coin flip that made multihop route setup unreliable for months. Newest-session-wins replaces a lifecycle that rejected healthy reconnects and clung to half-dead corpses. Alongside it: bounding the mux-bandwidth RTT probe so it can’t hang the run, and a hypervisor UI pass that adds a DMSG-server IP column, fixes text contrast, and declutters the skynet forms.
Skywire: Newest-Session-Wins for dmsg Reconnects
3035 fix(dmsg): replace stale sessions on reconnect (newest-session-wins) — fixes multihop route setup is the marquee fix, and it goes deeper than any of the route-setup patches that came before it. A dmsg session whose TCP link half-dies — drops without a FIN/RST — leaves its serve loop parked in AcceptStream and its entry sitting in the PK→session map. The old lifecycle did exactly the wrong thing with that corpse: a reconnecting session was rejected with “session already exists” and closed (the healthy one lost to the dead one), and delSession deleted by PK unconditionally, so a late-returning corpse could even evict its live replacement.
The consequences were measured live. A visor stayed advertised as delegated on a server that could no longer carry an inbound bridged stream — on one production visor, only 2 of its 8 “connected” servers could actually bridge to it; the other 6 hung the full 5s handshake timeout. Dialing a hop on :136 for route-ID reservation therefore burned ~5s per stale-server guess, and with only a ~10s reservation budget just ~2 servers got tried — so route setup returned “dmsg error 202 - cannot connect to delegated server” whenever the live server wasn’t picked first. Multihop setup became a per-hop coin flip, and zombie accept-loop goroutines piled up (21 serve goroutines for 8 reachable servers). The session ping never caught it because it exercises only the outbound path, not the inbound accept.
The fix is newest-session-wins: setSession now always installs the new session and closes the stale predecessor (outside the sessions mutex, to avoid the serve-stream self-deadlock), dialSession does the same instead of rejecting the reconnect, and delSession is identity-checked so an evicted predecessor’s unwinding serve goroutine can’t remove the live successor. It applies to both the dmsg server’s and the visor client’s session maps. A companion change in the same PR stops a transient startup lookup failure from clobbering a live discovery entry with a sequence-0 re-register: the fresh-entry path treated any lookup error as “entry does not exist” and posted a sequence-0 entry that the discovery then rejected 422; it now posts fresh only on a genuine not-found, surfacing transient errors for retry.
Skywire: Bound the Mux-Bandwidth RTT Probe
3037 fix(visor): bound mux-bw RTT probe so –probe-rtt can’t hang the run fixes a cli visor ping mux-bw --probe-rtt that could hang for tens of seconds and never return. The probe loop called PingOnce synchronously, and PingOnce armed a hardcoded 30s read deadline; when a probe’s echo stalled it blocked well past the measurement window, so the pump’s wait group never returned and the whole RPC and CLI hung. The fix bounds the entire round-trip with a configurable timeout (5s for the probe loop, shrunk to the remaining context each iteration) so a stalled probe can never outlive the window.
A second commit gives the probe its own route and conn. Loaded probing still hung because the probe and a pump goroutine shared one route conn, and the pump clears that conn’s deadline on every call — so a cleared deadline mid-probe-read made the read unbounded again. Dialing a dedicated probe route (kept out of the pumped set, with its route-established event suppressed) isolates it from the pump’s deadline churn. And a third commit adds a warm-up phase: a freshly set-up route is “cold” — its per-hop rules are still settling and its transports unprimed — so measuring the idle baseline on a cold route and the load phase on the now-warm route compared cold-vs-warm rather than unloaded-vs-loaded, producing nonsensical negative queueing deltas at 2–3 hops. A 5s warm-up primes the routes and discards the bytes before measurement, so loaded RTT and queueing delay become reliably measurable.
Skywire: Hypervisor UI — Service Health, Contrast, and Decluttered Forms
3036 feat(hvui): services-health IP column + endpoint format, text-contrast fixes, declutter skynet forms is a UI pass driven by operator feedback. The services-health page gains an IP column showing each DMSG server’s public ip:port, sourced from the dmsg-discovery all-servers cache queried over dmsg (non-server services render a hyphen); the endpoint column now shows the full scheme-and-path health endpoint (dmsg://<pk>:80/health) at readable contrast rather than the bare, double-dimmed host. A broad text-contrast fix raises the secondary-text and placeholder alpha across every component stylesheet — the dim hints were “too close to the background,” and a follow-up commit brightened them again after the first pass was still too dark against the white table headers. The skynet tab’s sprawling multi-row “Add port” and reverse-proxy forms are replaced with a single compact inline add-bar, with no fields or bindings removed. The PR also makes the embedded UI send Cache-Control headers — content-hashed assets immutable, index.html and SPA routes no-cache — so a UI redeploy is no longer masked by the browser heuristically caching the previous bundle.