Skip to content

Panic Recovery Around Plugin RunChecks() Calls

Problem

PluginRegistry.RunComplianceChecks in internal/audit/plugin.go called p.RunChecks(device) without any panic protection. The compliance.Plugin interface allows arbitrary implementations, including dynamically-loaded .so plugins. Any panic in RunChecks() propagated uncaught through the audit pipeline, killing the process and losing all other plugins' results.

Symptoms: If any compliance plugin panicked during RunChecks(), the entire audit crashed. Other healthy plugins' results were lost. No structured error reporting for the failure.

Root Cause

The compliance plugin interface (compliance.Plugin) is an extension point designed for both built-in and dynamically-loaded plugins. Dynamic plugins loaded via Go's plugin package can contain arbitrary code. A bare p.RunChecks(device) call provides no isolation boundary, so any panic in plugin code propagates up the call stack and terminates the audit.

Solution

Wrap the p.RunChecks(device) call in an immediately-invoked function with defer recover(). Add a *logging.Logger parameter to RunComplianceChecks for structured panic logging.

Code Example

// Before: bare call, panic propagates and kills the process
findings := p.RunChecks(device)

// After: panic recovery with structured logging and stack trace
var findings []compliance.Finding

func() {
    defer func() {
        if r := recover(); r != nil {
            logger.Error("plugin panicked during RunChecks",
                "plugin", pluginName,
                "panic", r,
                "stack", string(debug.Stack()),
            )
        }
    }()
    findings = p.RunChecks(device)
}()

The immediately-invoked function literal creates a deferred recovery scope isolated to the single plugin call. If RunChecks panics, the deferred function catches it, logs the plugin name and panic value, and execution continues with findings at its zero value (nil slice). The surrounding loop proceeds to process remaining plugins unaffected.

Key Design Decisions

  1. Logger parameter uses *logging.Logger -- matches the project's charmbracelet/log-based logging, ensuring panic recovery logs respect the application's configured output format, level, and destination. A nil guard defaults to a fallback logger so callers can never trigger a secondary panic on the recovery path.
  2. Stack trace included via runtime/debug.Stack() -- matches the pattern in internal/processor/processor.go, giving actionable debugging information for misbehaving plugins (especially dynamic .so files).
  3. Panicked plugins retained in results with safe defaults -- a panicked flag gates the post-recovery path. When set, safe defaults (pluginName, Version: "unknown (panicked)", empty compliance map) are populated and the loop continues, skipping method calls (Name(), Description(), GetControls()) on the potentially corrupt plugin. Downstream consumers can see the plugin was requested and evaluated. See GOTCHAS.md SS2.2.
  4. Post-recovery method calls avoided -- after a panic, the plugin's internal state may be corrupt. Calling p.Name(), p.GetControls(), etc. could trigger a secondary unrecovered panic. The recovery path uses only the pluginName string already in scope from the loop variable.

Files Changed

  • internal/audit/plugin.go -- core panic recovery wrapper with *logging.Logger and debug.Stack()
  • internal/audit/plugin_manager.go -- passes pm.logger (configured logger)
  • internal/audit/mode_controller.go -- passes mc.logger (configured logger)
  • internal/audit/plugin_global_test.go -- mockPanickingPlugin and isolation tests added
  • internal/audit/mode_controller_test.go -- call sites updated, LoadDynamicPlugins test enabled

Prevention

Pattern

When calling any method on an interface implementation that may originate from external or dynamic sources (plugin .so files, user-provided implementations), wrap the call in a deferred recover() that captures the panic value and logs it. The recovery boundary should be as narrow as possible -- one per plugin invocation, not one global recovery around the entire audit loop.

Where Else This Might Apply

Call site Method Status
RunComplianceChecks plugin.RunChecks() Protected (this fix)
InitializePlugins plugin.Name(), plugin.Version() Review needed -- dynamic plugin metadata accessors can panic
LoadDynamicPlugins plugin.Lookup("Plugin") result casting Review needed -- type assertion on plugin.Symbol can panic
GetControls() / GetControlByID() Called in deriveSeverityFromControl Review needed -- called during severity derivation

Testing Strategy

Use the mockPanickingPlugin pattern:

  1. Embed mockCompliancePlugin, override RunChecks to panic
  2. Table-driven tests with both panicking and healthy plugins
  3. Assert: healthy findings preserved, panicked plugin present with zero findings, no error returned
  4. Run just test-race to confirm no data races in recovery closures

Review Checklist

  • Plugin interface method calls on dynamic/external plugins are wrapped in defer recover()
  • Recovery boundary is per-plugin, not per-audit-run
  • Panicked plugins retained in results (not skipped) per GOTCHAS.md SS2.2
  • Tests include mockPanickingPlugin scenarios alongside healthy plugins
  • just test-race passes