Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gluster disconnections on the client side, resulting in locks being released by the client applications after restart of gluster nodes in a round robin manner. #4353

Open
murali437 opened this issue May 20, 2024 · 0 comments

Comments

@murali437
Copy link

Description of problem:
we have a 3 node gluster cluster in replicated-arbiter configuration. we restart gluster servers one at a time. We wait till the brick is back online and the healing is complete before repeating the process for subsequent gluster cluster nodes. For the first two nodes, reboot doesn't impact the clients. when the third node is rebooted we see that the client loses the locks it held before and applications begins to fail. We have tried different order of stopping the gluster nodes (replicated nodes followed by arbiter or arbiter followed by replicated nodes), the end result remains the same (file system is not available for the clients). In the steps provided before, we perform glusterd stop and killall glusterfs glusterfsd process to simulate reboot of the gluster node.

We've attached the logs from each of the gluster server nodes (host05.tar.gz, host06.tar.gz, host07.tar.gz ) and the gluster client (CliHost.tar.gz) in addition to a sample application (ClientApp.tar.gz) which performs the locking and monitoring the lock acquired, get all parameters from the gluster server etc.. The ClientApp.tar.gz contains the sample application code and a README to explain what it does and how to use it.

The exact command to reproduce the issue:
Step 1:
On Client side, execute:
   In First window:
      ./lockfile
   In Second window:
      ./checkLck
See README for more information.

Step 2:
On Gluster Server side:
On node1:
systemctl stop glusterd
killall glusterfs glusterfsd
systemctl start glusterd
gluster volume status
gluster volume heal info

On node2:
systemctl stop glusterd
killall glusterfs glusterfsd
systemctl start glusterd
gluster volume status
gluster volume heal info

On node3:
systemctl stop glusterd
killall glusterfs glusterfsd
systemctl start glusterd
gluster volume status
gluster volume heal info

The full output of the command that failed:
none of the command failed, but we see that the client side start receiving errors.
[2024-05-20 09:41:23.387247 +0000] I [MSGID: 114018] [client.c:2244:client_rpc_notify] 0-dr84dir_dirdata_gfs-client-2: disconnected from client, process will keep trying to connect glusterd until brick's port is available [{conn-name=dr84dir_dirdata_gfs-client-2}]
The message "W [MSGID: 114061] [client-common.c:797:client_pre_lk_v2] 0-dr84dir_dirdata_gfs-client-2: remote_fd is -1. EBADFD [{gfid=2d80904a-ad62-42d8-8856-6f0653a61f54}, {errno=77}, {error=File descriptor in bad state}]" repeated 18 times between [2024-05-20 09:40:51.321731 +0000] and [2024-05-20 09:41:27.354988 +0000]

Sample application releasing the lock when we shutdown the 3rd gluster node as show below
./checkLck /opt/data/cloud_dir/host01.lock
My PID is : 3530805
2024-05-20-15:03:56 IST: File /opt/data/cloud_dir/host01.lock is locked by 3530796
2024-05-20-15:10:51 IST: File /opt/data/cloud_dir/host01.lock is unlocked. Lock Type: 2 Whence: 0 Start: 0 len: 0

2024-05-20-15:10:53 IST: File /opt/data/cloud_dir/host01.lock is unlocked. Lock Type: 2 Whence: 0 Start: 0 len: 0
2024-05-20-15:10:55 IST: File /opt/data/cloud_dir/host01.lock is unlocked. Lock Type: 2 Whence: 0 Start: 0 len: 0
....

Expected results:
The client continues to function normally without disruption and the mounted file system is available for use by the applications and the locks held by the processes remains held until the process decides to release the locks.

Mandatory info:
- The output of the gluster volume info command:
Volume Name: dr84dir_dirdata_gfs
Type: Replicate
Volume ID: 0f91af99-291a-4a14-9010-a565b23a75d4
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x (2 + 1) = 3
Transport-type: tcp
Bricks:
Brick1: host05.domain:/Dirdata/dr84dir_dirdata_brk
Brick2: host06.domain:/Dirdata/dr84dir_dirdata_brk
Brick3: host07.domain:/Dirdata/dr84dir_dirdata_brk (arbiter)
Options Reconfigured:
network.frame-timeout: 300
server.tcp-user-timeout: 42
cluster.eager-lock: off
cluster.server-quorum-type: server
cluster.self-heal-daemon: enable
network.tcp-window-size: 1048576
server.event-threads: 24
client.event-threads: 24
features.cache-invalidation: on
performance.parallel-readdir: on
performance.global-cache-invalidation: true
performance.cache-invalidation: true
performance.readdir-ahead: off
performance.stat-prefetch: off
performance.open-behind: off
performance.quick-read: off
performance.read-ahead: off
performance.lazy-open: no
performance.write-behind: off
performance.strict-o-direct: on
performance.flush-behind: off
performance.io-cache: off
cluster.consistent-metadata: yes
cluster.quorum-type: auto
cluster.data-self-heal-algorithm: full
transport.address-family: inet
performance.client-io-threads: off
cluster.server-quorum-ratio: 51

- The output of the gluster volume status command:
Status of volume: dr84dir_dirdata_gfs
Gluster process TCP Port RDMA Port Online Pid

Brick host05.domain:/Dirdat
a/dr84dir_dirdata_brk 49180 0 Y 251268
Brick host06.domain:/Dirdat
a/dr84dir_dirdata_brk 49243 0 Y 482164
Brick host07.domain:/Dirdat
a/dr84dir_dirdata_brk 49223 0 Y 206760
Self-heal Daemon on localhost N/A N/A Y 251306
Self-heal Daemon on host06 N/A N/A Y 482235
Self-heal Daemon on host07 N/A N/A Y 206790

Task Status of Volume dr84dir_dirdata_gfs

There are no active volume tasks

- The output of the gluster volume heal command:
Launching heal operation to perform index self heal on volume dr84dir_dirdata_gfs has been successful
Use heal info commands to check status.

gluster volume heal dr84dir_dirdata_gfs info

Brick host05.domain:/Dirdata/dr84dir_dirdata_brk
Status: Connected
Number of entries: 0

Brick host06.domain:/Dirdata/dr84dir_dirdata_brk
Status: Connected
Number of entries: 0

Brick host07.domain:/Dirdata/dr84dir_dirdata_brk
Status: Connected
Number of entries: 0

**- Provide logs present on following locations of client and server nodes -
Logs from client and server nodes are attached.
GlusterData.tar.gz

Sever Logs:
GlusterData/host05.tar.gz
GlusterData/host06.tar.gz
GlusterData/host07.tar.gz
Client Logs:
GlusterData/CliHost.tar.gz

Client Application:
GlusterData/Clientapp.tar.gz

**- Is there any crash ? Provide the backtrace and coredump
No

Additional info:

- The operating system / glusterfs version:

Operating System: RHEL 8.8

glusterd --version
glusterfs 11.1
Repository revision: git://git.gluster.org/glusterfs.git
Copyright (c) 2006-2016 Red Hat, Inc. <https://www.gluster.org/>
...

glusterfs --version
glusterfs 11.1
Repository revision: git://git.gluster.org/glusterfs.git
Copyright (c) 2006-2016 Red Hat, Inc. <https://www.gluster.org/>
...

Note: Please hide any confidential data which you don't want to share in public like IP address, file name, hostname or any other configuration

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant