Question
ING
NL
Last activity: 7 Mar 2024 6:33 EST
Quiescing and graceful shutdown in Kubernetes environments
Good day.
Currently, we are attempting to implement zero-downtime rolling restart of Pega deployments in Kubernetes.
Pega is deployed via up-to-date Helm charts, into its own namespace.
What we expect to see during rollout restart is pega-web pods terminated in a graceful way and all the clients connected to those pods via Kubernetes Ingress/Service gracefully moved to new pods.
Unfortunately, we can not quite achieve this just yet.
What happens is, when the pega-web pod is scheduled for termination, it is immediately removed from Endpoint resource, and thus is becoming unavailable to clients / ingresses.
No network connectivity back to Ingress/end users is allowed.
This is as expected, per design of Kubernetes itself.
This unfortunately means downtime and loss of sessions for our end users, guaranteed.
We have tried both quiescing with immediate drain and slow drain, but the results do not differ much. We have tried adding preStop lifecycle hook to allow for connection draining to happen before quiescing happens.
But alas, users connected to the pod, that's being terminated, are loosing their sessions and state.
Good day.
Currently, we are attempting to implement zero-downtime rolling restart of Pega deployments in Kubernetes.
Pega is deployed via up-to-date Helm charts, into its own namespace.
What we expect to see during rollout restart is pega-web pods terminated in a graceful way and all the clients connected to those pods via Kubernetes Ingress/Service gracefully moved to new pods.
Unfortunately, we can not quite achieve this just yet.
What happens is, when the pega-web pod is scheduled for termination, it is immediately removed from Endpoint resource, and thus is becoming unavailable to clients / ingresses.
No network connectivity back to Ingress/end users is allowed.
This is as expected, per design of Kubernetes itself.
This unfortunately means downtime and loss of sessions for our end users, guaranteed.
We have tried both quiescing with immediate drain and slow drain, but the results do not differ much. We have tried adding preStop lifecycle hook to allow for connection draining to happen before quiescing happens.
But alas, users connected to the pod, that's being terminated, are loosing their sessions and state.
Given your expertise with Pega Cloud (which also runs on Kubernetes) we would like to request an advice on configuration we can use to achieve zero-downtime restarts.
This can be Kubernetes configuration (like use of EndpointSlices vs Endpoints) or Pega configuration, we are open to any and all suggestions.