Skip to content

fix(server): add startup probe for gateway boot#417

Merged
drew merged 1 commit intomainfrom
409-startup-probe-tls-logs/an
Mar 18, 2026
Merged

fix(server): add startup probe for gateway boot#417
drew merged 1 commit intomainfrom
409-startup-probe-tls-logs/an

Conversation

@drew
Copy link
Collaborator

@drew drew commented Mar 17, 2026

Summary

Add a Kubernetes startupProbe for the gateway StatefulSet so slow single-node boots get startup slack before liveness restarts begin.
Tighten TLS probe log handling so immediate EOF-style disconnects from TCP socket probes do not show up as misleading handshake errors.

Related Issue

Closes #409

Changes

  • add configurable startupProbe values to the OpenShell Helm chart and render the probe on the gateway StatefulSet
  • downgrade only UnexpectedEof TLS accept failures to debug-level logging while keeping other handshake failures at error
  • add unit coverage for the TLS handshake classifier and document the startup probe behavior in the gateway architecture doc

Testing

  • mise run pre-commit passes, except for the existing untracked ignored scratch/scrub.sh SPDX check failure in this workspace
  • Unit tests added/updated
  • E2E tests added/updated (if applicable)

Checklist

  • Follows Conventional Commits
  • Commits are signed off (DCO)
  • Architecture docs updated (if applicable)

@drew drew requested a review from a team as a code owner March 17, 2026 22:58
@drew drew self-assigned this Mar 17, 2026
@drew drew force-pushed the 409-startup-probe-tls-logs/an branch 2 times, most recently from e6080a0 to 70fc831 Compare March 17, 2026 23:17
@drew drew added the e2e label Mar 17, 2026
@drew drew merged commit 13f13c2 into main Mar 18, 2026
21 checks passed
@drew drew deleted the 409-startup-probe-tls-logs/an branch March 18, 2026 00:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

fix: pod CrashLoopBackOff during cluster startup due to flannel race and aggressive liveness timing

2 participants