Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gateway API and EKS (recent regression): upstream connection timeout #32616

Closed
2 of 3 tasks
Smana opened this issue May 19, 2024 · 10 comments
Closed
2 of 3 tasks

Gateway API and EKS (recent regression): upstream connection timeout #32616

Smana opened this issue May 19, 2024 · 10 comments
Labels
feature/k8s-gateway-api kind/community-report This was reported by a user in the Cilium community, eg via Slack. kind/question Frequently asked questions & answers. This issue will be linked from the documentation's FAQ. sig/agent Cilium agent related.

Comments

@Smana
Copy link
Contributor

Smana commented May 19, 2024

Is there an existing issue for this?

  • I have searched the existing issues

What happened?

Hey,

Recently Gateway API stopped working on EKS with this error:

curl https://grafana-mycluster-0.priv.cloud.ogenki.io 
upstream connect error or disconnect/reset before headers. reset reason: connection timeout

At firstt I thought this was caused by recent changes in my demo repo, so I tried a branch that I already used for demo purposes (with Gateway API working perfectly). Unfortunately, even without any changes in the code there's a regression. That's probably on AWS side but I didn't find the culprit so far:
Note that the traffic reaches the envoy service and there are no TLS issues, but envoy returns a 503.

curl https://capacitor-mycluster-0.priv.cloud.ogenki.io/ -vvv
* Host capacitor-mycluster-0.priv.cloud.ogenki.io:443 was resolved.
...
* Server certificate:
*  subject: C=France; O=Ogenki; CN=private-gateway.priv.cloud.ogenki.io
*  start date: May 18 11:20:19 2024 GMT
*  expire date: Aug 16 11:20:49 2024 GMT
*  subjectAltName: host "capacitor-mycluster-0.priv.cloud.ogenki.io" matched cert's "capacitor-mycluster-0.priv.cloud.ogenki.io"
*  issuer: O=Ogenki; CN=Private PKI - Vault Issuer
*  SSL certificate verify ok.
...
< HTTP/1.1 503 Service Unavailable
< content-length: 91
< content-type: text/plain
< date: Sat, 18 May 2024 16:07:04 GMT
< server: envoy
< 
* Connection #0 to host capacitor-mycluster-0.priv.cloud.ogenki.io left intact
upstream connect error or disconnect/reset before headers. reset reason: connection timeout

Everything seems ok from the Gateway API resources perspective

httproute

kubectl get httproute  -n flux-system -o yaml capacitor| yq .status
parents:
  - conditions:
      - lastTransitionTime: "2024-05-18T11:07:08Z"
        message: Accepted HTTPRoute
        observedGeneration: 1
        reason: Accepted
        status: "True"
        type: Accepted
      - lastTransitionTime: "2024-05-18T11:05:42Z"
        message: Service reference is valid
        observedGeneration: 1
        reason: ResolvedRefs
        status: "True"
        type: ResolvedRefs
    controllerName: io.cilium/gateway-controller
    parentRef:
      group: gateway.networking.k8s.io
      kind: Gateway
      name: platform-private
      namespace: infrastructure

gateway

kubectl get gateway -n infrastructure platform-private -o yaml | yq .status
addresses:
  - type: Hostname
    value: ae49855e1f64942b49d15fdf501b7cc3-45206332.eu-west-3.elb.amazonaws.com
conditions:
  - lastTransitionTime: "2024-05-18T11:07:08Z"
    message: Gateway successfully scheduled
    observedGeneration: 1
    reason: Accepted
    status: "True"
    type: Accepted
  - lastTransitionTime: "2024-05-18T11:07:12Z"
    message: Gateway successfully reconciled
    observedGeneration: 1
    reason: Programmed
    status: "True"
    type: Programmed
listeners:
  - attachedRoutes: 3
    conditions:
      - lastTransitionTime: "2024-05-18T11:20:49Z"
        message: Listener Programmed
        observedGeneration: 1
        reason: Programmed
        status: "True"
        type: Programmed
      - lastTransitionTime: "2024-05-18T11:20:49Z"
        message: Listener Accepted
        observedGeneration: 1
        reason: Accepted
        status: "True"
        type: Accepted
      - lastTransitionTime: "2024-05-18T11:20:49Z"
        message: Resolved Refs
        reason: ResolvedRefs
        status: "True"
        type: ResolvedRefs
    name: http
    supportedKinds:
      - group: gateway.networking.k8s.io
        kind: HTTPRoute

Of course, I checked obvious things such as the service being reachable using port-forward.

Regards,
Smana

Cilium Version

Tested with
v1.15.5
v1.15.3
v1.15.0

Kernel Version

The one provided in the AMI bottlerocket-aws-k8s-1.29-x86_64-v1.20.0-fcf71a47

Kubernetes Version

v1.29.4

Regression

That seems to be a regression but not related directly to Cilium changes.
Indeed using a branch that has already been used for GAPI purposes does not work anymore. (Same behavior)

Sysdump

cilium-sysdump-20240519-112239.zip

Relevant log output

No response

Anything else?

I've search for similar issues but they are pretty old:
#23906
#20942

Cilium Users Document

  • Are you a user of Cilium? Please add yourself to the Users doc

Code of Conduct

  • I agree to follow this project's Code of Conduct
@Smana Smana added kind/bug This is a bug in the Cilium logic. kind/community-report This was reported by a user in the Cilium community, eg via Slack. needs/triage This issue requires triaging to establish severity and next steps. labels May 19, 2024
@audacioustux
Copy link

audacioustux commented May 20, 2024

Facing probably the same issue, but with Ingress resource. Getting approximately 6-10% of 503 errors randomly (even when there's no load).
Screenshot 2024-05-20 at 10 11 26 AM

helm values:

    values: {
      kubeProxyReplacement: 'strict',
      k8sServiceHost: cluster.endpoint.apply((endpoint) => endpoint.replace('https://', '')),
      ingressController: {
        enabled: true,
        loadbalancerMode: 'shared',
        default: true,
      },
      hubble: {
        relay: {
          enabled: true,
        },
        ui: {
          enabled: true,
        },
      },
      loadBalancer: {
        algorithm: 'maglev',
        l7: {
          backend: 'envoy',
        },
      },
      envoy: {
        enabled: true,
      },
      routingMode: 'native',
      bpf: {
        masquerade: true,
      },
      ipam: {
        mode: 'eni',
      },
      eni: {
        enabled: true,
        awsEnablePrefixDelegation: true,
      },
    }

@bhm-kyndryl
Copy link

Hi there,

Having the same problem with AWS EKS 1.27. Created an issue for this few days ago:

"Gateway API backend PODs intermittent timeouts in hostNetwork mode"
#32592

On my side, if the backend webserver POD and the envoy receiving the HTTP request from ALB are on the same workernode, it is 100% success. Otherwise, 100% failure.

Bye

@sayboras
Copy link
Member

sayboras commented May 21, 2024

@Smana Thanks for your issue, we didn't have any test coverage right now for bottlerocket (similar issue #32610), just curious if you are facing the same issue with Amazon Linux.

@audacioustux @bhm-kyndryl It's hard to tell if the issue is the same, however, we have a couple of fixes merged recently in main, appreciate if you can test it out with main branch.

Again, thanks a lot for your issue and comment.

@audacioustux
Copy link

I've moved to AL2023 image from Bottlerocket, and it somehow got fixed completely. I'll try out the latest changes with Bottlerocket soon hopefully.

@Smana
Copy link
Contributor Author

Smana commented May 21, 2024

Same here: that indeed fixed my issues by switching to AL2023. Running a few additional tests before closing. Is there an issue to follow for Bottlerocket support?

@Smana
Copy link
Contributor Author

Smana commented May 22, 2024

I still have networking issues with AL2023, but that's probably another issue (maybe related to this) . I'm gonna try again with AL2

@sayboras
Copy link
Member

I don't think we have any issue to track the work for supporting bottlerocket though.

@sayboras sayboras added kind/question Frequently asked questions & answers. This issue will be linked from the documentation's FAQ. and removed kind/bug This is a bug in the Cilium logic. needs/triage This issue requires triaging to establish severity and next steps. labels May 22, 2024
@aleksanderaleksic
Copy link

aleksanderaleksic commented May 23, 2024

We were upgrading cilium from v1.14.6 (working) to v1.14.11 where we ran into the same issue as described above.
Hope that helps with scoping down the changes.

We switched from bottlerocket to AL2023 and it worked for us, not ideal but will do it for now.

@sayboras
Copy link
Member

It seems like the underlying issue is due to bottlerocket, but not related to Ingress/Gateway API implementation.

I am closing this issue as we are already having a couple of bottlerocket related issue (e.g. #32610). Feel free to re-open if you think otherwise. Thanks all.

@Smana
Copy link
Contributor Author

Smana commented May 23, 2024

Yes thank you @sayboras . I'm working on figure out how AL2023 isn't working properly too. But indeed, this is not related and I'll open an issue if necessary.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature/k8s-gateway-api kind/community-report This was reported by a user in the Cilium community, eg via Slack. kind/question Frequently asked questions & answers. This issue will be linked from the documentation's FAQ. sig/agent Cilium agent related.
Projects
None yet
Development

No branches or pull requests

6 participants