Back to list

Development Update — June 5

A big day the day after a release, and the dominant thread is multihop route setup again — but this time with a hard, pragmatic pivot. Live A/B testing showed the new source-driven cascade installs routes whose control plane succeeds on every hop yet whose data plane doesn’t carry traffic, so the default flips back to legacy direct-dial route setup while the cascade’s data-plane bug is chased down, with the dial loop hardened to walk more candidate intermediates and a mux that now self-heals its degree in the background. Around it: a benevolent bandwidth-spreading routing-policy preset, a TPD fix that keeps both edges of an expired transport (worth ~half the network’s bandwidth rewards), a launcher fix that had the skysocks- and vpn-clients dead on arrival, a synchronous bbolt integrity walk that stops ARM boards crash-looping, the ssh/sshd/sshfs commands unified under cli pty, and a clutch of CXO, skychat, and logserver ergonomics.

Skywire: Reliable Multihop Route Setup, and Legacy by Default

3027 fix(router): reliable multihop route setup — legacy default, handshake fallover, self-healing mux is a multi-commit reckoning with how routes get set up over a network of flaky intermediates. The dial loop is tightened first: the route-group handshake await drops 30s→12s (a handshake is one small packet each way, done in seconds even at six hops), the source-cascade ACK wait 10s→6s, and maxRetries 3→6 so the dial walks far more of the often-100+ candidate set instead of giving up after three. Crucially, a set-up route whose reverse handshake never completes is now treated as a retryable failure: the cascade reports success once rules are installed on every hop, but that’s only the control plane — a misbehaving intermediate that ACKs rule-install yet drops data-plane frames used to slip through and fail the whole dial. Now its intermediates are excluded and a fresh route through different ones is fetched, both in DialRoutes and in the ping/probe path.

That diagnosis — control plane succeeds, data plane doesn’t, 0/10 two-hop establishments while the legacy directly-dialed path establishes and pumps data — drove the headline decision: enable_cascade_route_setup defaults to false, so every visor (including existing configs) uses legacy direct-dial route setup where the setup node dials each hop to install rules. The legacy path is dmsg-202-limited, which is exactly what the cascade was built to avoid, but it’s a correct, working escape hatch until the cascade’s data-plane bug is fixed and the default can flip back. New diagnostics made this tractable: a bare “rule not found” log now carries the offending route ID and packet type, and the cascade logs its actual reserved route-ID slices.

The same PR adds self-healing mux degree. A multiplexed route group already redistributed traffic onto surviving legs the instant one died, but it didn’t restore the degree — a dropped leg was only replaced on the next periodic policy tick, and a plain --routes N group never replaced it at all, so the mux silently decayed. Now every leg drop dials a replacement in the background until the live degree is restored, bounded to one concurrent heal so a flapping leg can’t storm the setup node, with surviving legs carrying traffic throughout.

3026 feat(router/mux): per-leg loss signal + spread-bw sheds the single worst leg gives the mux a way to tell a lossy leg from a merely slow one. SACK retransmits were folded into the leg’s sent counter, so policies saw only latency; a distinct per-leg retransmit counter is now surfaced end-to-end into the Starlark leg.retransmits field. And the spread-bw preset’s tick logic — which used to drop every over-budget leg at once, collapsing the mux to zero — is rewritten to score each leg as latency_ms + retransmits*50 (loss weighted far above slowness) and shed at most one leg per tick, only the worst and only when clearly worse than the healthiest, never the last live leg.

Skywire: A Benevolent Bandwidth-Spreading Routing Policy

3024 feat(router/policy): embedded routing-policy presets + spread-bw adds a preset:<name> form to the per-dial policy that resolves to a curated routing policy embedded in the binary — no policy file to ship, no compile step, and a remote visor runs the exact same tested policy by name. The flagship preset, spread-bw, is the “browse benevolently” policy: it routes app traffic across four parallel multihop routes (mux=4, min_hops=2) and continuously rotates each leg onto fresh intermediates once it has carried a byte budget. Because min_hops>=2 forces real intermediate hops and intermediates earn bandwidth-reward credit for relaying, ordinary browsing through this policy distributes a reward-eligible bandwidth floor across many network visors instead of pinning one path — dynamic routing, not a static circuit.

Skywire: Keep Both Edges of an Expired Transport

3023 fix(tpd): persist transport edge pair so expired transports keep both edges (bandwidth rewards) fixes the root cause of the “one-sided records” bandwidth-reward problem. For an expired transport, the discovery rebuilds its edges from the daily-hash field names — but when only one edge ever published bandwidth (the common case, since the two edges publish on independent, intermittent schedules), the counterparty has no fields, its PK is unrecoverable, and it collapses to the zero PK. Measured live, ~half the network’s transport bandwidth (1.44 of 2.73 GB/day across 2552 transports) sat on transports whose second edge was the zero PK, always the responder. Since the reward calc can’t credit a zero-PK counterparty, those fell back to equal-share pooling. The fix persists the real edge pair at registration under a bw:edges:<id> key with the same 35-day TTL, and the recovery path reads it first — so an expired transport’s counterparty stays identifiable and symmetric-creditable. TPD-side only; takes effect on redeploy.

Skywire: skysocks-client and vpn-client Dead on Arrival

3025 fix(launcher): in-process apps broken by external-form args (skysocks-client/vpn-client won’t start) fixes two arg-construction bugs that left flag-configured in-process apps launching and immediately exiting. First, bool flags were emitted as a single-dash, value-suffixed token (-reconnect=true) which pflag parses as an unknown shorthand cluster and rejects — and since cli proxy start always sets --reconnect, every start carried a poison arg; it now emits a proper bare --reconnect when true and removes it when false. Second, the launcher handed in-process run functions the full external-launch args including the app <name> command prefix, which is wrong for a function that wants only flags; that prefix is now stripped on the in-process path. Surfaced after the skywire app skysocks {serve,client} restructure made this the common path. Validated live: cli proxy start now stays running and tunnels.

Skywire: Stop ARM Boards Crash-Looping on a Corrupt bbolt File

3013 fix(bbolthealth): synchronous integrity walk — recover() can’t catch tx.Check()’s goroutine panic fixes a production crash-loop on an ARM SBC. The integrity probe wrapped bbolt’s tx.Check() in recover() to turn a corruption panic into a move-the-file-aside verdict — but tx.Check() runs its scan in an internal goroutine (go tx.check(ch)), so the page-assertion panic fires there, where no caller-side recover() can reach it. The guard was dead for exactly the corruption class it existed to catch, and SD-card-backed boards corrupt their DB files on unclean shutdown. The fix replaces tx.Check() with a synchronous full walk of every reachable b-tree page in the caller’s goroutine, where the existing recover() catches the panic and moves the file aside. It fixes every RepairIfCorrupt consumer — CXO stores, visor stats, app log stores, clicache, skychat group store — and is validated against a real file with corrupted page headers.

Skywire: Unify ssh/sshd/sshfs Under cli pty

3018 refactor(cli): unify ssh/sshd/sshfs under a cli pty namespace drops the misleading OpenSSH-shaped names from commands that are all really the skywire pty subsystem — a noise-XK connection keyed to public keys. cli sshd becomes cli pty host, cli ssh becomes cli pty shell, and cli sshfs becomes cli pty fs. A follow-up commit folds the dmsg-overlay pty commands (exec/start/list/ui/url) in from cli dmsg pty so the whole pty surface lives under one namespace, and gives pty fs mount via-visor and dmsg-standalone transport modes so it can mount a peer’s filesystem through an already-running visor pty host without a standalone TCP listener.

Skywire: CXO, skychat, and Logserver Ergonomics

3015 feat(cxo): pk-as-identity for the standalone cxo daemon/cxo cli carries the <pk>@host:port addressing convention back to the standalone CXO node. It had no visible identity — with no key set it generated a fresh random PK each restart and never printed it — so a --sk flag now pins a stable identity, the node prints its PK and reachability lines on startup, and tcp/udp subscribe accepts the single-arg <pk>@<address> form alongside the historical two-arg form. 3017 docs(cxo): standalone CXO node usage guide documents that node end to end with a validated two-node worked example.

3014 feat(skychat): cli skychat events — structured NDJSON event stream replaces fragile tail -F | awk | grep log monitors (which broke under logrotate copytruncate) with a structured, resumable SSE event feed. Each event gets a monotonic sequence backed by a 10k-deep, 24h-TTL ring, filtered by a dm|group|pair|system channel taxonomy; the CLI streams NDJSON, dedupes by id, and resumes losslessly across reconnects. The existing /sse stream and cli skychat listen are byte-for-byte unchanged.

3019 feat(logserver): server-side filtering on /visor.log + docs lets the dmsg/skynet-served log endpoint ship only matching lines instead of the whole multi-MB log over a slow hop. Query-param filters (min-level, module regex, grep, since-line, limit, follow) are plumbed through cli log file <pk>, with a strict-level mode that drops non-standard lines so free-form library output can’t sneak past. It also ships docs/visor-logging-access.md documenting the three-whitelist auth model and a cli visor doctor RFC motivated by the week’s bugs that were all diagnosable with under a second of probing but went undiagnosed because nothing was probing.

3020 fix(tpd): pool gzip.Writer in gzipBytes to cut per-miss alloc pools the gzip writer that the transport-discovery allocates on every cache miss, trimming the per-miss allocation that still showed up in CPU pprof after the earlier metrics-cadence work.

Skywire: Tooling and Cleanups

3022 chore(deps,gotop): update-deps + gotop battery v0.11.0 port, -M disk usage, version overlay bumps all dependencies to latest (now possible on Go 1.26.4) and ports gotop’s battery widget to distatus/battery v0.11.0, whose State changed from an enum to a struct. The multiload (-M) display gains root-filesystem size/percent-used/free in the disk title, shortened temperature sensor names, a legend mask so the braille graph no longer bleeds through labels, and a version overlay drawing the full build string in the top-right corner. A follow-up pins bitfield/script back to v0.24.1 to keep the go.mod directive at 1.26.1, which the CI runners and the nix musl build can satisfy.

And a handful of cleanups: 3021 replicates a dependabot hono lockfile bump onto the fork, and 3016 removes two accidentally-committed local artifacts (an empty transports dump and a 19KB run log) with .gitignore guards so a visor run from the repo root can’t re-stage them.