Gateway API and EKS (recent regression): upstream connection timeout #32616

Smana · 2024-05-19T09:27:12Z

Is there an existing issue for this?

I have searched the existing issues

What happened?

Hey,

Recently Gateway API stopped working on EKS with this error:

curl https://grafana-mycluster-0.priv.cloud.ogenki.io 
upstream connect error or disconnect/reset before headers. reset reason: connection timeout

At firstt I thought this was caused by recent changes in my demo repo, so I tried a branch that I already used for demo purposes (with Gateway API working perfectly). Unfortunately, even without any changes in the code there's a regression. That's probably on AWS side but I didn't find the culprit so far:
Note that the traffic reaches the envoy service and there are no TLS issues, but envoy returns a 503.

curl https://capacitor-mycluster-0.priv.cloud.ogenki.io/ -vvv
* Host capacitor-mycluster-0.priv.cloud.ogenki.io:443 was resolved.
...
* Server certificate:
*  subject: C=France; O=Ogenki; CN=private-gateway.priv.cloud.ogenki.io
*  start date: May 18 11:20:19 2024 GMT
*  expire date: Aug 16 11:20:49 2024 GMT
*  subjectAltName: host "capacitor-mycluster-0.priv.cloud.ogenki.io" matched cert's "capacitor-mycluster-0.priv.cloud.ogenki.io"
*  issuer: O=Ogenki; CN=Private PKI - Vault Issuer
*  SSL certificate verify ok.
...
< HTTP/1.1 503 Service Unavailable
< content-length: 91
< content-type: text/plain
< date: Sat, 18 May 2024 16:07:04 GMT
< server: envoy
< 
* Connection #0 to host capacitor-mycluster-0.priv.cloud.ogenki.io left intact
upstream connect error or disconnect/reset before headers. reset reason: connection timeout

Everything seems ok from the Gateway API resources perspective

httproute

kubectl get httproute  -n flux-system -o yaml capacitor| yq .status
parents:
  - conditions:
      - lastTransitionTime: "2024-05-18T11:07:08Z"
        message: Accepted HTTPRoute
        observedGeneration: 1
        reason: Accepted
        status: "True"
        type: Accepted
      - lastTransitionTime: "2024-05-18T11:05:42Z"
        message: Service reference is valid
        observedGeneration: 1
        reason: ResolvedRefs
        status: "True"
        type: ResolvedRefs
    controllerName: io.cilium/gateway-controller
    parentRef:
      group: gateway.networking.k8s.io
      kind: Gateway
      name: platform-private
      namespace: infrastructure

gateway

kubectl get gateway -n infrastructure platform-private -o yaml | yq .status
addresses:
  - type: Hostname
    value: ae49855e1f64942b49d15fdf501b7cc3-45206332.eu-west-3.elb.amazonaws.com
conditions:
  - lastTransitionTime: "2024-05-18T11:07:08Z"
    message: Gateway successfully scheduled
    observedGeneration: 1
    reason: Accepted
    status: "True"
    type: Accepted
  - lastTransitionTime: "2024-05-18T11:07:12Z"
    message: Gateway successfully reconciled
    observedGeneration: 1
    reason: Programmed
    status: "True"
    type: Programmed
listeners:
  - attachedRoutes: 3
    conditions:
      - lastTransitionTime: "2024-05-18T11:20:49Z"
        message: Listener Programmed
        observedGeneration: 1
        reason: Programmed
        status: "True"
        type: Programmed
      - lastTransitionTime: "2024-05-18T11:20:49Z"
        message: Listener Accepted
        observedGeneration: 1
        reason: Accepted
        status: "True"
        type: Accepted
      - lastTransitionTime: "2024-05-18T11:20:49Z"
        message: Resolved Refs
        reason: ResolvedRefs
        status: "True"
        type: ResolvedRefs
    name: http
    supportedKinds:
      - group: gateway.networking.k8s.io
        kind: HTTPRoute

Of course, I checked obvious things such as the service being reachable using port-forward.

Regards,
Smana

Cilium Version

Tested with
v1.15.5
v1.15.3
v1.15.0

Kernel Version

The one provided in the AMI bottlerocket-aws-k8s-1.29-x86_64-v1.20.0-fcf71a47

Kubernetes Version

v1.29.4

Regression

That seems to be a regression but not related directly to Cilium changes.
Indeed using a branch that has already been used for GAPI purposes does not work anymore. (Same behavior)

Sysdump

cilium-sysdump-20240519-112239.zip

Relevant log output

No response

Anything else?

I've search for similar issues but they are pretty old:
#23906
#20942

Cilium Users Document

Are you a user of Cilium? Please add yourself to the Users doc

Code of Conduct

I agree to follow this project's Code of Conduct

The text was updated successfully, but these errors were encountered:

audacioustux · 2024-05-20T04:55:07Z

Facing probably the same issue, but with Ingress resource. Getting approximately 6-10% of 503 errors randomly (even when there's no load).

helm values:

    values: {
      kubeProxyReplacement: 'strict',
      k8sServiceHost: cluster.endpoint.apply((endpoint) => endpoint.replace('https://', '')),
      ingressController: {
        enabled: true,
        loadbalancerMode: 'shared',
        default: true,
      },
      hubble: {
        relay: {
          enabled: true,
        },
        ui: {
          enabled: true,
        },
      },
      loadBalancer: {
        algorithm: 'maglev',
        l7: {
          backend: 'envoy',
        },
      },
      envoy: {
        enabled: true,
      },
      routingMode: 'native',
      bpf: {
        masquerade: true,
      },
      ipam: {
        mode: 'eni',
      },
      eni: {
        enabled: true,
        awsEnablePrefixDelegation: true,
      },
    }

bhm-kyndryl · 2024-05-20T09:39:34Z

Hi there,

Having the same problem with AWS EKS 1.27. Created an issue for this few days ago:

"Gateway API backend PODs intermittent timeouts in hostNetwork mode"
#32592

On my side, if the backend webserver POD and the envoy receiving the HTTP request from ALB are on the same workernode, it is 100% success. Otherwise, 100% failure.

Bye

sayboras · 2024-05-21T14:02:32Z

@Smana Thanks for your issue, we didn't have any test coverage right now for bottlerocket (similar issue #32610), just curious if you are facing the same issue with Amazon Linux.

@audacioustux @bhm-kyndryl It's hard to tell if the issue is the same, however, we have a couple of fixes merged recently in main, appreciate if you can test it out with main branch.

Again, thanks a lot for your issue and comment.

audacioustux · 2024-05-21T15:34:34Z

I've moved to AL2023 image from Bottlerocket, and it somehow got fixed completely. I'll try out the latest changes with Bottlerocket soon hopefully.

Smana · 2024-05-21T16:14:25Z

Same here: that indeed fixed my issues by switching to AL2023. Running a few additional tests before closing. Is there an issue to follow for Bottlerocket support?

Smana · 2024-05-22T12:26:42Z

I still have networking issues with AL2023, but that's probably another issue (maybe related to this) . I'm gonna try again with AL2

sayboras · 2024-05-22T14:45:43Z

I don't think we have any issue to track the work for supporting bottlerocket though.

aleksanderaleksic · 2024-05-23T00:48:05Z

We were upgrading cilium from v1.14.6 (working) to v1.14.11 where we ran into the same issue as described above.
Hope that helps with scoping down the changes.

We switched from bottlerocket to AL2023 and it worked for us, not ideal but will do it for now.

sayboras · 2024-05-23T06:32:10Z

It seems like the underlying issue is due to bottlerocket, but not related to Ingress/Gateway API implementation.

I am closing this issue as we are already having a couple of bottlerocket related issue (e.g. #32610). Feel free to re-open if you think otherwise. Thanks all.

Smana · 2024-05-23T06:57:29Z

Yes thank you @sayboras . I'm working on figure out how AL2023 isn't working properly too. But indeed, this is not related and I'll open an issue if necessary.

Smana added kind/bug This is a bug in the Cilium logic. kind/community-report This was reported by a user in the Cilium community, eg via Slack. needs/triage This issue requires triaging to establish severity and next steps. labels May 19, 2024

Smana mentioned this issue May 21, 2024

Gateway API and EKS (recent regression): upstream connection timeout Smana/demo-cloud-native-ref#251

Closed

lmb added feature/k8s-gateway-api sig/agent Cilium agent related. labels May 21, 2024

sayboras added kind/question Frequently asked questions & answers. This issue will be linked from the documentation's FAQ. and removed kind/bug This is a bug in the Cilium logic. needs/triage This issue requires triaging to establish severity and next steps. labels May 22, 2024

sayboras closed this as completed May 23, 2024

maodahua mentioned this issue May 30, 2024

[FYI] Cilium not working on bottlerocket-v1.20 #32610

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Gateway API and EKS (recent regression): upstream connection timeout #32616

Gateway API and EKS (recent regression): upstream connection timeout #32616

Smana commented May 19, 2024

audacioustux commented May 20, 2024 •

edited

bhm-kyndryl commented May 20, 2024

sayboras commented May 21, 2024 •

edited

audacioustux commented May 21, 2024

Smana commented May 21, 2024 •

edited

Smana commented May 22, 2024

sayboras commented May 22, 2024

aleksanderaleksic commented May 23, 2024 •

edited

sayboras commented May 23, 2024

Smana commented May 23, 2024

Gateway API and EKS (recent regression): upstream connection timeout #32616

Gateway API and EKS (recent regression): upstream connection timeout #32616

Comments

Smana commented May 19, 2024

Is there an existing issue for this?

What happened?

Cilium Version

Kernel Version

Kubernetes Version

Regression

Sysdump

Relevant log output

Anything else?

Cilium Users Document

Code of Conduct

audacioustux commented May 20, 2024 • edited

bhm-kyndryl commented May 20, 2024

sayboras commented May 21, 2024 • edited

audacioustux commented May 21, 2024

Smana commented May 21, 2024 • edited

Smana commented May 22, 2024

sayboras commented May 22, 2024

aleksanderaleksic commented May 23, 2024 • edited

sayboras commented May 23, 2024

Smana commented May 23, 2024

audacioustux commented May 20, 2024 •

edited

sayboras commented May 21, 2024 •

edited

Smana commented May 21, 2024 •

edited

aleksanderaleksic commented May 23, 2024 •

edited