Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Application stuck in deletion because of a postDelete hook #18270

Open
3 tasks done
joffreychambrin opened this issue May 17, 2024 · 6 comments
Open
3 tasks done

Application stuck in deletion because of a postDelete hook #18270

joffreychambrin opened this issue May 17, 2024 · 6 comments
Labels
bug Something isn't working

Comments

@joffreychambrin
Copy link

Checklist:

  • I've searched in the docs and FAQ for my answer: https://bit.ly/argocd-faq.
  • I've included steps to reproduce the bug.
  • I've pasted the output of argocd version.

Describe the bug

Since we've upgrade to to ArgoCD v2.10.0, we have a specific application that gets stuck in Deletion. The only solution I have found is to manually remove the finalizers from the App.

To Reproduce

  1. Install the latest ArgoCD version on a Kubernetes cluster,
  2. Then add a Helm app with:
    URL: https://charts.releases.teleport.dev
    Chart: teleport-kube-agent.
    Version: 15.0.2
  3. And delete the ArgoCD app.

Expected behavior

The post hook delete job will be launch, and once completed the app should be automatically deleted from ArgoCD.

Version

argocd: v2.11.0+d3f33c0
  BuildDate: 2024-05-07T16:01:41Z
  GitCommit: d3f33c00197e7f1d16f2a73ce1aeced464b07175
  GitTreeState: clean
  GoVersion: go1.21.9
  Compiler: gc
  Platform: linux/arm64

Logs

time="2024-05-17T09:44:28Z" level=error msg="Recovered from panic: runtime error: invalid memory address or nil pointer dereferencegoroutine 230 [running]:
runtime/debug.Stack()
    /usr/local/go/src/runtime/debug/stack.go:24 +0x64
github.com/argoproj/argo-cd/v2/controller.(*ApplicationController).processAppOperationQueueItem.func1()
    /go/src/github.com/argoproj/argo-cd/controller/appcontroller.go:890 +0x50
panic({0x2fb9ae0?, 0x6a9bf20?})
    /usr/local/go/src/runtime/panic.go:914 +0x218
github.com/argoproj/argo-cd/v2/controller.(*ApplicationController).executePostDeleteHooks(0x4000bee380, 0x401a639000, 0x40230d3200?, 0x4028b6af80?, 0x20?, 0x0?)
    /go/src/github.com/argoproj/argo-cd/controller/hook.go:101 +0x8f4
github.com/argoproj/argo-cd/v2/controller.(*ApplicationController).finalizeApplicationDeletion(0x4000bee380, 0x401a638c00, 0x36b3559?)
    /go/src/github.com/argoproj/argo-cd/controller/appcontroller.go:1168 +0x560
github.com/argoproj/argo-cd/v2/controller.(*ApplicationController).processAppOperationQueueItem(0x4000bee380)
    /go/src/github.com/argoproj/argo-cd/controller/appcontroller.go:926 +0x2c0
github.com/argoproj/argo-cd/v2/controller.(*ApplicationController).Run.func4()
    /go/src/github.com/argoproj/argo-cd/controller/appcontroller.go:831 +0x2c
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0xae6fb19b55bdc225?)
    /go/pkg/mod/k8s.io/apimachinery@v0.26.11/pkg/util/wait/wait.go:157 +0x40
k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0x875c9889805965b5?, {0x49d7500, 0x4001575080}, 0x1, 0x40005b8240)
    /go/pkg/mod/k8s.io/apimachinery@v0.26.11/pkg/util/wait/wait.go:158 +0x90
k8s.io/apimachinery/pkg/util/wait.JitterUntil(0x4f1857488fcceb0a?, 0x3b9aca00, 0x0, 0xcc?, 0x231652ab9f10a9f1?)
    /go/pkg/mod/k8s.io/apimachinery@v0.26.11/pkg/util/wait/wait.go:135 +0x80
k8s.io/apimachinery/pkg/util/wait.Until(0x2997fcc5e506dda9?, 0x7c42a7c42a95252d?, 0xd8649a624c79c08a?)
    /go/pkg/mod/k8s.io/apimachinery@v0.26.11/pkg/util/wait/wait.go:92 +0x28
created by github.com/argoproj/argo-cd/v2/controller.(*ApplicationController).Run in goroutine 70
    /go/src/github.com/argoproj/argo-cd/controller/appcontroller.go:830 +0x728
"
@joffreychambrin joffreychambrin added the bug Something isn't working label May 17, 2024
@ChristianCiach
Copy link
Contributor

What kind of hook are you using? A Job? Looking at the code, there is something wrong with the health check for the hook. Are you using custom health checks as Lua scripts?

@ChristianCiach
Copy link
Contributor

I think https://github.com/argoproj/argo-cd/blob/v2.11.0/controller/hook.go just misses necessary nil-checks after calling GetResourceHealth. Every other caller of GetResourceHealth (both in argo-cd and in gitops-engine) checks the returned health object for nil, but not the post-delete hook controller. The returned health object can indeed be nil.

@ChristianCiach
Copy link
Contributor

ChristianCiach commented May 17, 2024

@joffreychambrin You can probably workaround the bug by adding this to your argocd-cm configmap:

  resource.customizations: |
    batch/Job:
      health.lua: |
        hs = {}
        hs.status = "Progressing"
        hs.message = ""
        if obj.status ~= nil then
          if obj.status.conditions ~= nil then
            for i, condition in ipairs(obj.status.conditions) do
              if condition.type == "Complete" and condition.status == "True" then
                hs.status = "Healthy"
                return hs
              end
              if condition.type == "Failed" and condition.status == "True" then
                hs.status = "Degraded"
                return hs
              end
            end
          end
        end
        return hs

Completely untested, and I would advise against it. The idea here is to guarantee a non-nil hookHealth object in

if hookHealth.Status == health.HealthStatusProgressing {
to workaround the missing nil check there.

@ChristianCiach
Copy link
Contributor

@joffreychambrin The unit test introduced in #16595 only tests this feature by using a bare Pod. So it would probably also work when you replace your Job with a bare Pod.

@ChristianCiach
Copy link
Contributor

@alexmt Maybe you want to know about this :)

@joffreychambrin
Copy link
Author

joffreychambrin commented May 20, 2024

@ChristianCiach Yes the hook I am using is a job. It is the default one from Teleport here: https://github.com/gravitational/teleport/blob/70ba6be2eac4f8e5c275ffac6246197ef798392a/examples/chart/teleport-kube-agent/templates/delete_hook.yaml#L47

I think you are right in your analysis, because as you can see in the above link, there are multiple resources with the post-delete hook: a service-account, a role, a roleBinding, and a job. And not all of them have a healthCheck, therefore the necessity to have a nil check on the hookHealth

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants