Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory usage spikes on appliance with 100+ providers #22900

Open
jaywcarman opened this issue Feb 15, 2024 · 6 comments
Open

Memory usage spikes on appliance with 100+ providers #22900

jaywcarman opened this issue Feb 15, 2024 · 6 comments
Assignees

Comments

@jaywcarman
Copy link
Member

Issue Description

Memory usage periodically spikes until out of memory. Eventually something crashes and a bunch of workers are killed. Things start back up and the process repeats.

Memory used (%) since last reboot on 2024-01-31 until 2024-02-15:

image

Environment

  • Appliance running in IBM Cloud VPC (16 vCPU | 160 GiB | 32 Gbps)
  • VERSION=4.2.0 (IM)
  • BUILD=20230913103031_b15dc60
  • Providers:
vmdb_production=# SELECT COUNT(*),type 
vmdb_production-# FROM ext_management_systems 
vmdb_production-# WHERE type LIKE '%::CloudManager' OR type LIKE '%::InfraManager'
vmdb_production-# GROUP BY type;
 count |                               type                               
-------+------------------------------------------------------------------
    57 | ManageIQ::Providers::IbmCloud::PowerVirtualServers::CloudManager
    48 | ManageIQ::Providers::IbmCloud::VPC::CloudManager
     1 | ManageIQ::Providers::IbmPowerHmc::InfraManager
     3 | ManageIQ::Providers::IbmPowerVc::CloudManager
     1 | ManageIQ::Providers::Openstack::CloudManager
     1 | ManageIQ::Providers::Openstack::InfraManager
(6 rows)

Logs

I saved off journalctl entries for the following events. The log files are too large to upload to GitHub (also, they may contain sensitive information). I can provide access to them as-needed.

Memory usage spikes ~10% on 2024-02-08 11:10

image

journalctl save: journalctl_20240208105500-20240208130500.txt

Memory usage drop on 2024-02-14 00:10

image

journalctl save: journalctl_20240214000000-20240214003000.txt

Memory usage spikes ~10% 2024-02-15 12:55

image

journalctl save: journalctl_20240214125000-20240214130500.txt
top_output save: top_output_20240215.log

From the top output it looks like used stays flat while buff/cache increases:

$  grep -n "^MiB Mem" top_output_20240215.log
~~snip~~
488291:MiB Mem : 160910.6 total,  37859.1 free, 104039.2 used,  19012.3 buff/cache
489176:MiB Mem : 160910.6 total,  37666.9 free, 104048.2 used,  19195.5 buff/cache
490071:MiB Mem : 160910.6 total,  34829.9 free, 104047.5 used,  22033.1 buff/cache
490963:MiB Mem : 160910.6 total,  31841.1 free, 104259.1 used,  24810.4 buff/cache
491862:MiB Mem : 160910.6 total,  29232.8 free, 104038.6 used,  27639.2 buff/cache
492757:MiB Mem : 160910.6 total,  26424.3 free, 104029.0 used,  30457.3 buff/cache
493650:MiB Mem : 160910.6 total,  23600.8 free, 104035.0 used,  33274.8 buff/cache
494550:MiB Mem : 160910.6 total,  21883.7 free, 103876.1 used,  35150.8 buff/cache
495446:MiB Mem : 160910.6 total,  21620.7 free, 103817.4 used,  35472.6 buff/cache
~~snip~~
``
@jaywcarman jaywcarman added the bug label Feb 15, 2024
@kbrock
Copy link
Member

kbrock commented Feb 15, 2024

I wonder if this is due to supports_feature=true in our api queries. Those force loading all providers into memory.

Typically there are not that many providers, so this will slide through. But once we get enough providers, then forcing the filtering out of the db and into ram will start to show up.

Question:

Do we know what the server was doing when it hit an out of memory?

@jaywcarman
Copy link
Member Author

Do we know what the server was doing when it hit an out of memory?

@kbrock I just shared an internal box folder that contains a journalctl save that includes the "crash" - journalctl_20240214000000-20240214003000.txt at around Feb 14 00:10:00.

The first OOM message I see are related to the DB connection:

$  grep -n OutOfMemory journalctl_20240214000000-20240214003000.txt
1100:Feb 14 00:02:30 rdr-cloud-usage.com evm[2580]:  WARN -- evm: MIQ(VmdbDatabaseConnection.log_statistics) Unable to log stats, 'PG::OutOfMemory: ERROR:  out of memory
1184:Feb 14 00:04:30 rdr-cloud-usage.com production[1853442]: ERROR -- production: There was an exception - ActiveRecord::StatementInvalid(PG::OutOfMemory: ERROR:  out of memory
1246:Feb 14 00:04:30 rdr-cloud-usage.com api[3789]: ERROR -- api: MIQ(Api::AuthController.api_error) ActiveRecord::StatementInvalid: PG::OutOfMemory: ERROR:  out of memory
1248:Feb 14 00:04:30 rdr-cloud-usage.com production[3789]: FATAL -- production: Error caught: [ActiveRecord::StatementInvalid] PG::OutOfMemory: ERROR:  out of memory
1979:Feb 14 00:07:41 rdr-cloud-usage.com api[807264]: ERROR -- api: MIQ(Api::AuthController.api_error) ActiveRecord::StatementInvalid: PG::OutOfMemory: ERROR:  out of memory
1986:Feb 14 00:07:41 rdr-cloud-usage.com production[807264]: FATAL -- production: Error caught: [ActiveRecord::StatementInvalid] PG::OutOfMemory: ERROR:  out of memory

@agrare
Copy link
Member

agrare commented Feb 15, 2024

From the top output it looks like used stays flat while buff/cache increases:

This is extremely odd, memory used by buffer cache shouldn't cause OOM it is supposed to be able to be freed at anytime. If there was an OOM while there was available memory that is really weird.

One thing that comes to mind, what is the filesystem usage looking like here? If you're using mmap'd memory pages backed by files and you run out of space on the filesystem it'll actually trigger a memory fault not a ENOSPC

@jaywcarman
Copy link
Member Author

jaywcarman commented Feb 15, 2024

From the top output it looks like used stays flat while buff/cache increases:

This is extremely odd, memory used by buffer cache shouldn't cause OOM it is supposed to be able to be freed at anytime. If there was an OOM while there was available memory that is really weird.

I might have confused things a bit here - the top_output.log doesn't include data from when the OOM occurred, just from a subsequent memory usage jump.

what is the filesystem usage looking like here?

[root@rdr-cloud-usage ~]# df -h
Filesystem                              Size  Used Avail Use% Mounted on
devtmpfs                                 79G     0   79G   0% /dev
tmpfs                                    79G   41M   79G   1% /dev/shm
tmpfs                                    79G  1.9M   79G   1% /run
tmpfs                                    79G     0   79G   0% /sys/fs/cgroup
/dev/mapper/vg_system-lv_os              11G  6.5G  4.1G  62% /
/dev/vda1                              1014M  216M  799M  22% /boot
/dev/mapper/vg_system-lv_home          1014M   47M  968M   5% /home
/dev/mapper/vg_system-lv_var             12G  2.4G  9.7G  20% /var
/dev/mapper/vg_pg-lv_pg                 500G   26G  475G   6% /var/lib/pgsql
/dev/mapper/vg_system-lv_var_log         21G   14G  7.1G  67% /var/log
/dev/mapper/vg_system-lv_var_log_audit  507M   66M  441M  13% /var/log/audit
/dev/mapper/vg_system-lv_tmp           1014M   40M  975M   4% /tmp
tmpfs                                    16G     0   16G   0% /run/user/0

@agrare
Copy link
Member

agrare commented Feb 15, 2024

One other thing that comes to mind is I've heard of some issues on linux (specifically with ZFS on linux since ZFS makes heavy use of in-memory ARC cache) with not being able to free cached memory "fast enough" to prevent an OOM.

This is assuming there was still available memory when the OOM hit which we don't know.

@miq-bot
Copy link
Member

miq-bot commented Jun 3, 2024

This issue has been automatically marked as stale because it has not been updated for at least 3 months.

If you can still reproduce this issue on the current release or on master, please reply with all of the information you have about it in order to keep the issue open.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants