Provide visibility into Bootstrap Container behaviors - exit status, time, etc? #3811
Labels
area/core
Issues core to the OS (variant independent)
area/kubernetes
K8s including EKS, EKS-A, and including VMW
type/enhancement
New feature or request
What I'd like:
We use a few bootstrap containers on startup - some of them label hosts, others handle custom max-pods calculations, etc. Because booting new hosts is more important to us than the occasional host that might boot "incorrectly configured", we choose to mark these as
essential=false
to ensure that we are never blocked in booting new capacity. (This decision has saved us many outages).The thing is ... once your host is booted, you have no idea whether or not the Bootstrap scripts worked. You can scroll through the Journald Logs, but thats it. You don't know how long a host waited to execute a script, how long it took to pull down an image, or what the exit codes were.
We want to keep track of the number of Bootstrap Containers that start up and fail so that we can alert on that, but not block the booting process. In an ideal world, we would also have some method for getting metrics on how long it took these containers to run, which would help us optimize our new-host boot time (but that's really for extra credit).
Preferred Behavior
When I think about how to approach this - I feel like the most natural thing is for each Bootstrap Container to become a "condition" on the node - so that a simple
kubectl describe node ...
will get you information on it. From there, metrics can be collected about which nodes have which conditions on them, and teams can develop any alerting or behaviors they need.Any alternatives you've considered:
We first went down the path of trying to use the Node Problem Detector with this configuration (below) - but discoverd that it really only tails logs from the moment it starts up, so it cannot react to logs that existed before it comes up .. therefore it cannot have visibility into the Bootstrap Containers.
The text was updated successfully, but these errors were encountered: