Back to list

Development Update — June 8

A day with two threads running in parallel: making multihop route setup tolerate the network’s churn instead of failing on it, and giving the resolving proxy real source-routing power. The route-setup work adds a per-hop id-reservation retry on transient dmsg-202 and replaces dmsgweb’s reconnect storm with a single client. The resolver work lets you name a dmsg rendezvous server or an exact transport path right in the hostname — pinning how a .dmsg or .skynet address is reached. Around them: latency-weighted route selection becomes the finder’s default, debug logs and pprof get consolidated onto one dmsg port, and a third CXO goroutine leak is closed.

Skywire: Make Multihop Setup Tolerate dmsg Churn

3039 fix(dmsg): cut multihop-setup & dmsgweb churn — id-reserver 202 retry + single-client dmsgweb attacks the churn from two sides. During route setup the id-reserver dials every hop on :136 to reserve route IDs and failed the whole reservation on the first error — which then tripped the far more expensive 6× whole-route-setup retry above it, each one re-querying the route-finder and eating the reverse-handshake timeout. The common failure is a transient “dmsg error 202” on a single hop that is reachable again seconds later (measured: the same intermediate probed 12/12 reachable moments after it 202’d during setup) — a dmsg server on the path briefly churned a session. Each hop dial now gets a short bounded retry (3 attempts, linear backoff) on a transient 202, each re-dial re-running the multi-server selection, so once the churn clears the reservation succeeds instead of failing the whole route.

The second half kills a reconnect storm in dmsgweb. The default mode started N “bootstrap” clients (one per dmsg server) sharing one PK just to carry discovery HTTP, plus a main client on top — all under the same key. A dmsg server permits one session per PK, so wherever a bootstrap client and the main client landed on the same server the server kicked the duplicate, the loser redialed and kicked the other, and so on — observed at ~30 session events/sec, which made the proxy’s PK intermittently unreachable. It’s replaced with a single client whose discovery HTTP rides its own sessions, preloaded with every server entry. A local A/B showed EOF session kicks dropping 55→0, “session already exists” 26→0, and log volume 690→53 lines, with one client PK instead of ~9 colliding.

3042 fix(router): retry id-reservation on stale pooled connection (the #1 RSN setup failure) closes the gap left by the pool-liveness fix from two days earlier. That change discarded a pooled connection only once it idled past the read deadline — using idle time as a liveness proxy — but a connection that died for any other reason (the intermediate restarted, its session dropped) has a recent lastUsed, so Get still handed back a corpse and the first reservation returned “connection is shut down.” Diagnosed live from the setup node’s /stats: the RSN was at 91% success, and ~90% of failures were id-reservation 331 “connection is shut down” plus context-deadline 170, with pprof confirming no goroutine leak — transient stale connections, not overload. The reserver now re-dials once on “connection is shut down” and retries, replacing the client in the map so the later rule-install reuses the live conn; a reachable hop succeeds on the retry, a genuinely dead hop still fails.

Skywire: Source-Route the Resolving Proxy by PK or Transport ID

3040 feat(resolver): pin dmsg server + source-route skynet by PK or transport ID turns the resolver hostname into a routing instruction. A new shared parser, ParseResolverHost, reads hostnames of the form [<vhost>.]<r1>...<rN>.<destPK>.<suffix>: the destination PK stays the label next to the suffix (so existing <vhost>.<pk>.<suffix> names parse identically), PK-shaped labels to its left form a routing chain in source order, and remaining labels are the vhost — each routing element classified by shape as a visor PK or an exact transport-ID uuid. The suffix decides meaning. For .dmsg, the routing label is a rendezvous dmsg server: <client-pk>.<server-pk>.dmsg now dials the client through the named server’s session instead of resolving via discovery, letting a browser reach a direct or hidden dmsg client by naming the server it’s on, with no all-servers guessing.

For .skynet, the routing labels are source-route hops: <hop>...<dest>.skynet dials the exact forward and reverse routing hops via ForwardHops/ReverseHops, bypassing the route-finder entirely. A transport-ID hop resolves its edges to find the next visor; a PK hop names the next visor, whose transport is found or — for this visor’s own hop — created (stcpr then sudph, never dmsg, per the multi-hop dmsg rule). A companion CLI command, cli tp route-addr, computes the deterministic transport IDs along an ordered PK path and assembles the <tpid-1>...<tpid-N>.<dest-pk>.skynet address to paste into the proxy. The PR also adds a vhost Host-header rewrite to dmsgweb (so magnetosphere.net.<pk>.dmsg reaches the right backend site), and validated the whole thing live against magnetosphere: bare, vhost, pinned, vhost+pin, and the skynet source-route all returned 200, with the router log confirming the forward rule used the specified transport. A final fix in the PR ensures the embedded dmsgweb listener (4445) actually binds — it had blocked init on dmsg readiness, which under a bounded boot context could outlive the context so the app was never registered; the readiness wait now moves into serve and is bounded.

Skywire: Latency-Weighted Route Selection by Default

3041 feat(route-finder): latency-weighted route selection (default) fixes the finder choosing routes by hop count alone while ignoring the very latency it already had. The finder builds its graph from GetTransportsByEdge, which overlays each edge’s measured latency onto the transport — but then picked among equal-hop routes by graph-adjacency order, so a detour like US→AU→DE could beat a clean direct-ish path, and the visor’s downstream latency-ranking can’t recover a good path the finder never returns. GetRouteWeighted now ranks a bounded candidate pool by total measured latency (with a per-hop penalty substituted for unmeasured edges, so all-unmeasured routes degrade to hop count), and the service uses it by default — free, since the graph already carries the latency. An audit found 92% of transports carry measured latency. The CLI’s older --by-latency flag stays opt-in: a live test showed it picking a +Inf-latency route on incomplete data, so aligning the CLI onto the same algorithm is left as a follow-up.

Skywire: One Debug Port, and a Third CXO Leak

3043 feat(services): consolidate pprof + add /debug/log onto dmsg :80; fix visor /visor.log path unifies how deployment services expose debugging. Services served pprof on a separate dmsg :81 listener and had no log endpoint over dmsg at all — debug logs were only reachable on the host’s stdout. Now WithDebug folds /debug/pprof/* and a new /debug/log onto the service’s main dmsg :80 handler (survey-whitelist-gated), retiring the :81 listener; the log endpoint is served from a thread-safe bounded ring buffer fed by a logrus hook, so recent log output is available with no disk file. It also fixes the visor’s /visor.log, which opened the wrong path (visor.log rather than the real log/skywire.log) and always 404’d. After this, every deployment service exposes both pprof and debug logging over dmsg on one port through the survey-whitelist proxy — making the RSN’s live id-reservation failures watchable behind its /stats counters.

And 3038 fix(cxo): close the feeds actor in Node.Close() — stops a goroutine leak closes the third unbounded-goroutine leak in this stretch. Node.Close() shut down peers, connections, transports and the DB but never closed the per-node feeds actor — its close() existed but was marked unused and never called — so every node teardown leaked the feeds dispatcher goroutine and every per-feed and per-head goroutine it owned. A subscriber that periodically reconnects piled these up without bound (observed at ~1650 parked goroutines out of ~4700, bloating the heap until GC consumed ~80% of the process CPU); Close() now closes the feeds actor, which cascades cleanly down each feed and head on the close signal.