Memory usage is low #626

QiXuanWang · 2024-05-12T14:41:36Z

Describe the issue

After 2 weeks struggle, finally I got My A770 working on Fedora38.
But it seems the training is barely faster than my 24 cpu machine.
I tried to increase batch size, but the memory consumption kept same at 1.7g. Is it expected?
What could I do to improve the training performance and to increase the memory usage?

jgong5 · 2024-05-13T00:08:21Z

What's the problem of "faster"? The memory consumption is about the host or the device? It is weird if it is for the device. Can you share your training script?

QiXuanWang · 2024-05-13T13:18:52Z

Sorry for the confusion.
I'm using A770 GPU to test the ML code. I can't install xpumanager on Fedora for now due to lots of issues. So I use intel_gpu_top and zello_sysman to check resource usage.
I installed latest "intel_extension_for_pytorch-2.1.30+xpu". My python version is 3.11

My code used a very simple transformer encode layer and some linear layers.

"
if device == "xpu":
model = model.to("xpu")
loss_fn = loss_fn.to("xpu")
model,optimizer = ipex.optimize(model, optimizer=optimizer)
X_train = X_train.to("xpu")
y_train = y_train.to("xpu")
X_val = X_val.to("xpu")
y_val = y_val.to("xpu")
for epoch in range(100):
for i,data in enumerate(X_train):
x_data, y_data = get_batch(X_train, y_train, i, batch_size)
if x_data is None:
break
y_pred = model(x_data)
loss = loss_fn(y_pred, y_data)
optimizer.zero_grad()
loss.backward()
optimizer.step()
"

I use Xeon 6146 with 24 Cores to train it originally but it takes very long time. I was hoping that using A770 GPU it could be much faster (5X+). But it turns out the speed is not "faster" at all.

As for resource usage, with "intel_gpu_top", the "Blitter" usage by "python3" process is like at 10%, which I feel is not correct.
So the question is, how could I fully utilize the GPU. And is the 5X speedup with GPU is reasonable expectation?
I may have to install a new Nvidia GPU later on to test too. But in case my usage or code is not correct...
Any information is helpful.

feng-intel · 2024-05-14T02:42:20Z

May I ask why do you have to use Fedora38 ?

QiXuanWang · 2024-05-14T12:44:05Z

I don't have to use Fedora38. But installed it since it's compatible with RHEL. I have to pick one among RHEL, SUSE, CentOS.
I think the performance issue is mainly internal GPU computing mechanism, no?
BTW, what linux distribution do you recommend, besides Ubuntu?

QiXuanWang · 2024-05-14T14:18:20Z

In case it's helpful, here is the memory summary:

===========================================================================|
| PyTorch XPU memory summary, device ID 0 |
|---------------------------------------------------------------------------|
| XPU OOMs: 0 | xpuMalloc retries: 0 |
===========================================================================|
| Metric | Cur Usage | Peak Usage | Tot Alloc | Tot Freed |
|---------------------------------------------------------------------------|
| Allocated memory | 812 KB | 28210 KB | 3407 GB | 3407 GB |
| from large pool | 0 KB | 16128 KB | 1974 GB | 1974 GB |
| from small pool | 812 KB | 12468 KB | 1433 GB | 1433 GB |
|---------------------------------------------------------------------------|
| Active memory | 812 KB | 28210 KB | 3407 GB | 3407 GB |
| from large pool | 0 KB | 16128 KB | 1974 GB | 1974 GB |
| from small pool | 812 KB | 12468 KB | 1433 GB | 1433 GB |
|---------------------------------------------------------------------------|
| GPU reserved memory | 34816 KB | 34816 KB | 34816 KB | 0 B |
| from large pool | 20480 KB | 20480 KB | 20480 KB | 0 B |
| from small pool | 14336 KB | 14336 KB | 14336 KB | 0 B |
|---------------------------------------------------------------------------|
| Non-releasable memory | 7379 KB | 28019 KB | 4321 GB | 4321 GB |
| from large pool | 0 KB | 18688 KB | 2756 GB | 2756 GB |
| from small pool | 7379 KB | 11449 KB | 1565 GB | 1565 GB |
|---------------------------------------------------------------------------|
| Allocations | 186 | 274 | 17274 K | 17274 K |
| from large pool | 0 | 9 | 1155 K | 1155 K |
| from small pool | 186 | 266 | 16119 K | 16119 K |
|---------------------------------------------------------------------------|
| Active allocs | 186 | 274 | 17274 K | 17274 K |
| from large pool | 0 | 9 | 1155 K | 1155 K |
| from small pool | 186 | 266 | 16119 K | 16119 K |
|---------------------------------------------------------------------------|
| GPU reserved segments | 8 | 8 | 8 | 0 |
| from large pool | 1 | 1 | 1 | 0 |
| from small pool | 7 | 7 | 7 | 0 |
|---------------------------------------------------------------------------|
| Non-releasable allocs | 15 | 30 | 6580 K | 6580 K |
| from large pool | 0 | 2 | 625 K | 625 K |
| from small pool | 15 | 29 | 5954 K | 5954 K |
===========================================================================|

feng-intel · 2024-05-15T01:38:56Z

I tried aliyun os and it's OK to run oneAPI and IPEX. I think aliyun os is CentOS
基于ECS Intel实例部署GPT-2大语言模型 - 云起实验室-在线实验-上云实践-阿里云开发者社区-阿里云官方实验平台-阿里云 (aliyun.com)

martinmCGG · 2024-05-15T09:13:59Z

You could try running your code in Intel VTune to see the CPU/GPU compute+memory usage and find possible bottlenecks.

QiXuanWang · 2024-05-15T13:22:12Z

You could try running your code in Intel VTune to see the CPU/GPU compute+memory usage and find possible bottlenecks.

Thanks. I'll try.
BTW my CPU version runs fine. The CPU usage is full and memory usage is reasonable. And training speed is acceptable. But with GPU, the power consumption and memory consumption tells it's not working as expected. The training speed is extremely slow. The results are a little bit weird too.
Not sure if it's a known issue or maybe somehow the installation is broken.

QiXuanWang · 2024-05-15T13:23:03Z

I tried aliyun os and it's OK to run oneAPI and IPEX. I think aliyun os is CentOS 基于ECS Intel实例部署GPT-2大语言模型 - 云起实验室-在线实验-上云实践-阿里云开发者社区-阿里云官方实验平台-阿里云 (aliyun.com)

Oh. I use my own machine for this task. My problem is not that it doesn't run. But the results are suspicious.

feng-intel · 2024-05-20T06:34:34Z

We have not verified on Fedora38.
Firstly, we should check the results correction.
I suggest use CentOS to check if you like.

feng-intel added the ARC ARC GPU label May 13, 2024

feng-intel self-assigned this May 20, 2024

jingxu10 added the Performance label May 30, 2024

feng-intel closed this as completed Jun 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory usage is low #626

Memory usage is low #626

QiXuanWang commented May 12, 2024

jgong5 commented May 13, 2024

QiXuanWang commented May 13, 2024

feng-intel commented May 14, 2024

QiXuanWang commented May 14, 2024

QiXuanWang commented May 14, 2024 •

edited

feng-intel commented May 15, 2024

martinmCGG commented May 15, 2024

QiXuanWang commented May 15, 2024

QiXuanWang commented May 15, 2024

feng-intel commented May 20, 2024

Memory usage is low #626

Memory usage is low #626

Comments

QiXuanWang commented May 12, 2024

Describe the issue

jgong5 commented May 13, 2024

QiXuanWang commented May 13, 2024

feng-intel commented May 14, 2024

QiXuanWang commented May 14, 2024

QiXuanWang commented May 14, 2024 • edited

feng-intel commented May 15, 2024

martinmCGG commented May 15, 2024

QiXuanWang commented May 15, 2024

QiXuanWang commented May 15, 2024

feng-intel commented May 20, 2024

QiXuanWang commented May 14, 2024 •

edited