Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question about openvpn systemd unit file #485

Open
krm3 opened this issue Jan 17, 2024 · 9 comments
Open

Question about openvpn systemd unit file #485

krm3 opened this issue Jan 17, 2024 · 9 comments

Comments

@krm3
Copy link

krm3 commented Jan 17, 2024

We use the openvpn packages for Debian bookworm from https://build.openvpn.net/debian/openvpn/. As we ran in #449 with 2.6.7 (thanks @patcable for reporting) we noticed that in the systemd unit file for openvpn KillMode ist set to 'process' and not 'control-group'. Therefore after every segfault there were zombies left and when MaxTasks was reached (which is set to 10) the openvpn service could not start again. That's why the segfault behaviour led to a complete openvpn service outage for us.

My question is: what is the reason that KillMode is set to 'process' here? systemd manual page is saying: "Note that it is not recommended to set KillMode= to process or even none, as this allows processes to escape the service manager's lifecycle and resource management, and to remain running even while their service is considered stopped and is assumed to not consume any resources."

Thanks in advance for your explanation.

@cron2
Copy link
Contributor

cron2 commented Jan 18, 2024 via email

@krm3
Copy link
Author

krm3 commented Jan 26, 2024

This sounds not like what should happen. If OpenVPN crashes, and has current child processes (like for auth plugin, or anything else), these should be re-parented to systemd, and no zombies should ever happen. Zombie processes happen if the parent process is still there and is not properly calling wait() on its child processes - but if the parent process dies (SIGSEGV), this scenario can not happen.

I think "zombie" was the wrong term. I think the processes were still parented to systemd. I will try to reproduce this with 2.6.7 and investigate further and then come back.

Klara

@krm3
Copy link
Author

krm3 commented Jan 29, 2024

We have four openvpn services on one node (udp/ipv6, udp/ipv4 and the same with pushing no default route but split routes). On 2024-01-26 01:31 I installed 2.6.7 again and started the services. Soon after, segfaults must have been happened (but I see nothing in the logs) . When I looked at the services about 7 hours later timestamps were showing that services had restarted and one service had already 7 Tasks (normal state is 2). I deactivated the node in the loadbalancer so no new sessions could be established. Recent output cutout from systemctl status:

● openvpn@tun6u.service - OpenVPN connection to tun6u
     Active: active (running) since Fri 2024-01-26 11:53:39 CET; 3 days ago
      Tasks: 3 (limit: 10)
● openvpn@tun4u.service - OpenVPN connection to tun4u
     Active: active (running) since Fri 2024-01-26 08:16:37 CET; 3 days ago
      Tasks: 7 (limit: 10)
● openvpn@tun6us.service - OpenVPN connection to tun6us
     Active: active (running) since Fri 2024-01-26 01:31:13 CET; 3 days ago
      Tasks: 2 (limit: 10)
● openvpn@tun4us.service - OpenVPN connection to tun4us
     Active: active (running) since Fri 2024-01-26 01:31:13 CET; 3 days ago
      Tasks: 2 (limit: 10)

Whole output for openvpn@tun4u.service:

● openvpn@tun4u.service - OpenVPN connection to tun4u
     Loaded: loaded (/lib/systemd/system/openvpn@.service; enabled; preset: enabled)
     Active: active (running) since Fri 2024-01-26 08:16:37 CET; 3 days ago
       Docs: man:openvpn(8)
             https://community.openvpn.net/openvpn/wiki/Openvpn24ManPage
             https://community.openvpn.net/openvpn/wiki/HOWTO
   Main PID: 11345 (openvpn)
     Status: "Initialization Sequence Completed"
      Tasks: 7 (limit: 10)
     Memory: 11.6M
        CPU: 15min 45.190s
     CGroup: /system.slice/system-openvpn.slice/openvpn@tun4u.service
             ├─ 1338 /usr/sbin/openvpn --daemon ovpn-tun4u --status /run/openvpn/tun4u.status 10 --cd /etc/openvpn --config /etc/openvpn/tun4u.conf --writepid /run/openvpn/tun4u.pid
             ├─ 7956 /usr/sbin/openvpn --daemon ovpn-tun4u --status /run/openvpn/tun4u.status 10 --cd /etc/openvpn --config /etc/openvpn/tun4u.conf --writepid /run/openvpn/tun4u.pid
             ├─ 8327 /usr/sbin/openvpn --daemon ovpn-tun4u --status /run/openvpn/tun4u.status 10 --cd /etc/openvpn --config /etc/openvpn/tun4u.conf --writepid /run/openvpn/tun4u.pid
             ├─10482 /usr/sbin/openvpn --daemon ovpn-tun4u --status /run/openvpn/tun4u.status 10 --cd /etc/openvpn --config /etc/openvpn/tun4u.conf --writepid /run/openvpn/tun4u.pid
             ├─11117 /usr/sbin/openvpn --daemon ovpn-tun4u --status /run/openvpn/tun4u.status 10 --cd /etc/openvpn --config /etc/openvpn/tun4u.conf --writepid /run/openvpn/tun4u.pid
             ├─11345 /usr/sbin/openvpn --daemon ovpn-tun4u --status /run/openvpn/tun4u.status 10 --cd /etc/openvpn --config /etc/openvpn/tun4u.conf --writepid /run/openvpn/tun4u.pid
             └─11347 /usr/sbin/openvpn --daemon ovpn-tun4u --status /run/openvpn/tun4u.status 10 --cd /etc/openvpn --config /etc/openvpn/tun4u.conf --writepid /run/openvpn/tun4u.pid

I think this is not what should happen. When the limit of Tasks is reached the service cannot start again. This is what happened to us after we upgraded to 2.6.7.

@cron2
Copy link
Contributor

cron2 commented Jan 29, 2024

Are you using plugin in your configs? If yes, which plugin is used?

The fact that you have "Tasks: 2" in steady state is unusual, but is normal when using plugin-auth-pam (for example) because that one forks, to keep root privs (and do deferred auth).

So I guess there is a plugin bug involved, not noticing if OpenVPN dies - and thus not exiting. So, not a Zombie in the unix sense ("a process that has already exited, and no parent calling wait() to reap the status"), because a Zombie wouldn't have a command line visible anymore.

So we should see if this plugin bug can be fixed (and of course see that OpenVPN won't SIGSEGV again...) - but this said, it does make sense for systemd to kill all child processes as well, in this case.

Depending on the source of the debian unit file, it won't be on us (upstream) to fix it... I'll ping the debian maintainer for his opinion.

@krm3
Copy link
Author

krm3 commented Jan 30, 2024

Yes, we are using plugin-auth-pam. Thank you very much, that makes sense and sounds good to me.

@bernhardschmidt
Copy link

bernhardschmidt commented Jan 31, 2024

Debian Maintainer here. You are using openvpn@.service which is a unit shipped only by Debian, but the upstream provided openvpn-server@.service have the same issue. I agree that we should probably just change the KillMode. However, I'm not sure why the processes are stuck here at all. I have only seen that with DCO when the kernel module hung, and in that case changing the KillMode will probably not help you (the processes are unkillable).

Can you kill the processes manually by PID? Does it help to locally override KillMode=control-group (the default)?

@cron2
Copy link
Contributor

cron2 commented Jan 31, 2024

I'm fairly sure that this is a bug / misfeature in plugin-auth-pam- it forks, and both parts talk via a socketpair, but I'm not sure the client side ever notices if the parent goes away. So it should be killable just fine, but I'll look into fixing this.

I do wonder if there is a possible drawback on changing the KillMode in the general case, like a sub-process failing to clean up "something" when being killed by systemd, instead of signalled by OpenVPN. I do knot know anything, though.

@krm3
Copy link
Author

krm3 commented Jan 31, 2024

Can you kill the processes manually by PID?

Yes, it works:

root@ovpn-l3-mgmt-110:~# ps waux | grep openvpn | grep tun4u | grep -v tun4us
root        1338  0.0  0.0  13344  4264 ?        S    Jan26   0:00 /usr/sbin/openvpn --daemon ovpn-tun4u --status /run/openvpn/tun4u.status 10 --cd /etc/openvpn --config /etc/openvpn/tun4u.conf --writepid /run/openvpn/tun4u.pid
root        7956  0.0  0.0  13344  4160 ?        S    Jan26   0:00 /usr/sbin/openvpn --daemon ovpn-tun4u --status /run/openvpn/tun4u.status 10 --cd /etc/openvpn --config /etc/openvpn/tun4u.conf --writepid /run/openvpn/tun4u.pid
root        8327  0.0  0.0  13344  4232 ?        S    Jan26   0:00 /usr/sbin/openvpn --daemon ovpn-tun4u --status /run/openvpn/tun4u.status 10 --cd /etc/openvpn --config /etc/openvpn/tun4u.conf --writepid /run/openvpn/tun4u.pid
root       10482  0.0  0.0  13344  4244 ?        S    Jan26   0:00 /usr/sbin/openvpn --daemon ovpn-tun4u --status /run/openvpn/tun4u.status 10 --cd /etc/openvpn --config /etc/openvpn/tun4u.conf --writepid /run/openvpn/tun4u.pid
root       11117  0.0  0.0  13344  4212 ?        S    Jan26   0:00 /usr/sbin/openvpn --daemon ovpn-tun4u --status /run/openvpn/tun4u.status 10 --cd /etc/openvpn --config /etc/openvpn/tun4u.conf --writepid /run/openvpn/tun4u.pid
openvpn    11345  0.2  0.0  15560 11448 ?        Ss   Jan26  17:34 /usr/sbin/openvpn --daemon ovpn-tun4u --status /run/openvpn/tun4u.status 10 --cd /etc/openvpn --config /etc/openvpn/tun4u.conf --writepid /run/openvpn/tun4u.pid
root       11347  0.0  0.0  13344  4312 ?        S    Jan26   0:00 /usr/sbin/openvpn --daemon ovpn-tun4u --status /run/openvpn/tun4u.status 10 --cd /etc/openvpn --config /etc/openvpn/tun4u.conf --writepid /run/openvpn/tun4u.pid
root@ovpn-l3-mgmt-110:~# kill 1338
root@ovpn-l3-mgmt-110:~# ps waux | grep openvpn | grep tun4u | grep -v tun4us
root        7956  0.0  0.0  13344  4160 ?        S    Jan26   0:00 /usr/sbin/openvpn --daemon ovpn-tun4u --status /run/openvpn/tun4u.status 10 --cd /etc/openvpn --config /etc/openvpn/tun4u.conf --writepid /run/openvpn/tun4u.pid
root        8327  0.0  0.0  13344  4232 ?        S    Jan26   0:00 /usr/sbin/openvpn --daemon ovpn-tun4u --status /run/openvpn/tun4u.status 10 --cd /etc/openvpn --config /etc/openvpn/tun4u.conf --writepid /run/openvpn/tun4u.pid
root       10482  0.0  0.0  13344  4244 ?        S    Jan26   0:00 /usr/sbin/openvpn --daemon ovpn-tun4u --status /run/openvpn/tun4u.status 10 --cd /etc/openvpn --config /etc/openvpn/tun4u.conf --writepid /run/openvpn/tun4u.pid
root       11117  0.0  0.0  13344  4212 ?        S    Jan26   0:00 /usr/sbin/openvpn --daemon ovpn-tun4u --status /run/openvpn/tun4u.status 10 --cd /etc/openvpn --config /etc/openvpn/tun4u.conf --writepid /run/openvpn/tun4u.pid
openvpn    11345  0.2  0.0  15560 11448 ?        Ss   Jan26  17:35 /usr/sbin/openvpn --daemon ovpn-tun4u --status /run/openvpn/tun4u.status 10 --cd /etc/openvpn --config /etc/openvpn/tun4u.conf --writepid /run/openvpn/tun4u.pid
root       11347  0.0  0.0  13344  4312 ?        S    Jan26   0:00 /usr/sbin/openvpn --daemon ovpn-tun4u --status /run/openvpn/tun4u.status 10 --cd /etc/openvpn --config /etc/openvpn/tun4u.conf --writepid /run/openvpn/tun4u.pid

Does it help to locally override KillMode=control-group (the default)?

Just implemented this on another node. We will see if the number of tasks remains 2.

@krm3
Copy link
Author

krm3 commented Feb 1, 2024

Does it help to locally override KillMode=control-group (the default)?

Just implemented this on another node. We will see if the number of tasks remains 2.

Seems to help. The number of tasks is still 2 for all services although the services have obviously been restarted i.e. segfaults have occurred.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants