Skip to content

Migrating: pkg/controls supervisor & lifecycle hardening

This release hardens the pkg/controls supervisor and goroutine lifecycle. It fixes seven interlocking defects (see the spec). The change is behaviourally breaking for one case (described below) but introduces no breaking exported-signature changes β€” all API additions are backward compatible.


Breaking Changes

Restart policy no longer restarts a clean Start return

Package: pkg/controls

Previously, a service whose StartFunc returned nil (the normal case for a server that spawns its listener in a background goroutine) was treated as having exited and, if it had a RestartPolicy, was restarted in a loop β€” including sending nil on the error channel.

Each run is now classified explicitly. A nil return is a clean start, not an exit: the service is supervised via its StatusFunc/health check (when HealthFailureThreshold > 0) and otherwise simply runs until shutdown. Only a genuine error (or a health-threshold breach) triggers a restart.

Before (buggy):

// A RestartPolicy on a background-serving service caused a restart storm at
// startup, because Start returning nil was misread as "exited".
controller.Register("api",
    controls.WithStart(func(ctx context.Context) error {
        go srv.Serve(ln) // returns nil immediately
        return nil
    }),
    controls.WithRestartPolicy(controls.RestartPolicy{MaxRestarts: 5}),
)

After (correct):

The same registration no longer restarts on the clean nil return. To have such a service restarted on failure, supervise it via a health check:

controller.Register("api",
    controls.WithStart(func(ctx context.Context) error {
        go func() {
            if err := srv.Serve(ln); err != nil && !errors.Is(err, http.ErrServerClosed) {
                state.setExit(err)
            }
        }()
        return nil
    }),
    controls.WithStatus(func() error { return state.exitErr() }),
    controls.WithRestartPolicy(controls.RestartPolicy{
        MaxRestarts:            5,
        HealthFailureThreshold: 3,
        HealthCheckInterval:    5 * time.Second,
    }),
)

Migration: Most callers need no change β€” the built-in pkg/http and pkg/grpc transports register without a restart policy and were never meant to restart on a clean start. If you relied on the old (incorrect) restart-on-clean-return behaviour, switch to health-check-driven supervision as shown above.

RestartPolicy.MaxRestarts now counts consecutive failures

MaxRestarts previously counted lifetime restarts. It now counts consecutive failures: after a service runs healthily for RestartResetInterval (default 30 s), the counter resets to zero. A service that fails, recovers, and later fails again is no longer prematurely declared as having exceeded its restart budget.


New (additive, non-breaking) APIs

WithValidError(fn ValidErrorFunc) ControllerOpt

Registers a predicate identifying expected terminal errors (e.g. http.ErrServerClosed, context.Canceled). A matching error is treated as a graceful end-of-run: it neither counts toward the restart total nor is forwarded on the error channel.

controller := controls.NewController(ctx,
    controls.WithValidError(func(err error) bool {
        return errors.Is(err, http.ErrServerClosed)
    }),
)

WithRestartResetInterval(d time.Duration) ServiceOption

Sets the healthy-run duration after which a service's consecutive-failure counter resets (default controls.DefaultRestartResetInterval = 30 s). Implies a restart policy if the service has none.

RestartPolicy.RestartResetInterval time.Duration

New field on the existing struct (zero selects the default). Adding a struct field is backward compatible for positional-free struct literals.


Other behavioural fixes (no API change)

  • Idempotent Start() β€” a second Start() while already running is a no-op (no double-start, no hung Wait()).
  • No busy-spin β€” the error/context handler no longer spins a CPU core after cancellation; all controller goroutines terminate at shutdown.
  • Nil Start/Stop β€” services registered without these default to no-ops instead of panicking.
  • WithoutSignals now genuinely leaves the default OS signal disposition in place (signal.Notify is registered only after options are applied, and signal.Stop is called when the channel is swapped or at shutdown).
  • Force-stop β€” on shutdown, services stop in reverse registration order, one at a time; a StopFunc that ignores its context is abandoned at the shutdown deadline rather than hanging Wait() forever.
  • Readiness fails closed β€” /readyz reports not-ready for an async readiness check until its first run completes.