Implementation notes: controls supervisor & lifecycle hardening¶
Spec: 2026-06-12-controls-supervisor-lifecycle.md
Branch: fix/controls-supervisor-lifecycle
What was implemented¶
All seven findings (and the three folded-in low/info items) are addressed.
- D1 β Explicit run outcomes.
Services.classifyRun(ctx, err)returns one ofoutcomeCleanStart,outcomeCancelled,outcomeError.runOnceand the newrunOnceWithRestartconsult it. AnilStart return is a clean start, never an exit; a clean-start service with no health monitoring blocks inmonitorHealthon<-ctx.Done()rather than falling through to a restart. - D2 β No nil on errs; consecutive-failure counter.
runWithRestartPolicyonly forwards non-nil Start errors; health failures store their error viamonitorHealth/updateInfo. The counter resets afterRestartResetInterval(defaultDefaultRestartResetInterval= 30 s).errors.Wrap(nil, β¦)is guarded in the max-restarts path. - D3 β Idempotent Start.
StartdoescompareAndSetState(Unknown, Running)and returns early otherwise. The service count is snapshotted underservices.mubefore the matchingwg.Add. - D4 β No busy-spin; goroutines terminate. The error/context handler nils its
local
donechannel after first receipt and exits onshutdownComplete(closed byhandleStopMessage), draining buffered errors first. The message processor and signal handler also exit onshutdownComplete; the signal handler adds a second-signal force-exit. - D5 β Validated lifecycle funcs.
Services.adddefaults missingStart/StoptonoopStart/noopStop.callStoprecover-wraps the stop call (matching the existingcallProbepattern). - D6 β Signal disposition + force-stop.
signal.Notifyis deferred until after options inNewControllerand only fires whenc.signals != nil.SetSignalsChanneland shutdown callsignal.Stop.Services.stopiterates in reverse registration order, running eachStopFuncin a goroutine awaited againstctx.Done()so a context-ignoring stop is abandoned at the deadline. - D7 β Readiness fails closed; ValidErrorFunc wired.
toServiceStatustakes afailClosedflag (true only forReadiness); a nil cached async result is reported not-ready.WithValidErrorsetsController.validError, propagated toServices.validErrorinStartand consulted byclassifyRun.
Deviations from the spec (and why)¶
Registerdoes NOT return an error. The spec offered "return an error OR default to no-ops" (D5). I chose default-to-no-ops.Registeris on the widely-implementedRunner/Controllableinterface with a generated mock and many call sites (pkg/grpc,pkg/http,pkg/gateway,pkg/telemetry). Defaulting keeps the signature stable, avoids a mock regen + multi-package churn, and still satisfies "every call site nil-checks before invoking" because the funcs are never nil after registration. Net result: no breaking exported-signature change (see O4).- Reverse-order stop is best-effort past the deadline. When a
StopFuncexceeds the shutdown deadline, the remaining (earlier-registered) services still get a stop attempt, but with the deadline already elapsed their ownctx.Done()fires immediately. The abandoned goroutine is left to finish on its own (a small, bounded leak that ends when the misbehaving stop eventually returns). This favours "always return fromWait()" over "guarantee every stop completes", which is the spec's intent. monitorHealthnow returnsbool(true = ended by health-failure). It also blocks on<-ctx.Done()when there is nothing to monitor, so a clean start is never restarted. This is internal-only.
O4 β Enumerated exported-signature changes (for changelog/migration)¶
Migration note: docs/migration/v0.16-controls-supervisor.md.
| Symbol | Before | After | Kind |
|---|---|---|---|
controls.WithValidError(ValidErrorFunc) ControllerOpt |
β | new | Additive |
controls.WithRestartResetInterval(time.Duration) ServiceOption |
β | new | Additive |
controls.RestartPolicy.RestartResetInterval time.Duration |
β | new field | Additive |
controls.DefaultRestartResetInterval (= 30s) |
β | new const | Additive |
controls.Register |
Register(id string, opts ...ServiceOption) |
unchanged | No change |
There is no breaking exported-signature change. The breaking change is
behavioural: a clean Start return is no longer restarted, and MaxRestarts
now counts consecutive (not lifetime) failures. Documented with a BREAKING
CHANGE: footer and the migration note above. Pre-1.0, this is a MINOR bump.
Verification¶
go test ./pkg/controls/β green.go test -race ./pkg/controls/β green across repeated runs (the core goal).golangci-lint run ./pkg/controls/...β 0 issues.- Coverage on
pkg/controls: ~96% of statements (β₯90% gate met). - Dependent packages (
pkg/grpc,pkg/http,pkg/gateway,pkg/telemetry) and the fullgo test ./...suite pass. mockerywas not run locally (binary not installed in the worktree), but no mocked interface signature changed, somocks/pkg/controls/Controllable.gois unaffected and still compiles (go build ./mocks/...passes).- Controls E2E BDD (
INT_TEST_E2E_CONTROLS=1 just test-e2e) β ran green in this worktree (graceful shutdown with real HTTP+gRPC, restart-after-health-failure, idempotent stop, context-cancellation, readiness/liveness scenarios all pass).
Open questions for review¶
Registerreturn value. Confirmed choice: default-to-no-ops over returning an error (keeps the API non-breaking). If reviewers prefer the validatingRegister(...) errorvariant, it is a larger, mock-regenerating change β flag if wanted.- Restart-reset default of 30 s. Taken verbatim from the resolved O1. Confirm this is the right default for the common embedder.
- Best-effort reverse-stop after deadline. Confirm the "abandon and continue, leaving the goroutine to finish" semantics (vs. abandoning all remaining stops at the first deadline breach).
- E2E ran locally; integration tests pending CI. The controls BDD E2E suite
(
INT_TEST_E2E_CONTROLS=1 just test-e2e) ran green in this worktree. TheINT_TEST=1 just test-integrationgroup (which also stands up real servers) was not separately run here; the*_integration_test.gofiles compile against the new API. Please confirm the integration group in CI. - Timing tolerances.
TestForceStop_*uses a 100 ms shutdown timeout andTestReadiness_FailsClosed*blocks the first async run via a channel. These are poll/Eventually-based, not hard sleeps, but confirm the tolerances hold on slower CI runners under-race.