GPU SYSMEN Shared Memory is slow - Is possible use REBAR with this to fix performance?

Since NVIDIA allowed the use of system memory, it has been possible to avoid OOM problems during inferences on large models that consume more than the GPU’s internal memory. However, when we are training models, performance is very degrading. I imagine that the way information is exchanged between system memory and GPU memory could be the problem. Could this be a bus problem, from what I read about REBAR it doesn’t help in this regard? Wouldn’t it be possible to use this same resource to speed up this data exchange in AI training processes?

This problem is analogous to an operating system swapping memory to mass storage when the working set of a program exceeds the available system memory: the throughput to system memory maybe 100 GB/sec, but the throughput of non-rotational mass storage may be 5 GB/sec. Once swapping occurs, application performance will drop significantly. I recently encountered a severe case on my Windows system, with total system memory oversubscribed by 1.6x: the GUI froze intermittently.

In this case system memory is used as backup storage for GPU memory, with the added twist that the PCIe interconnect connecting the two represents the weakest link: While the GPU memory may have a throughput of 800 GB/sec and the system memory one of 100 GB/sec, a PCIe gen4 x16 link only offers 25 GB/sec per direction.

This is a classical trade-off: You trade the ability to run models that require more memory than the GPU provides on-board for a serious performance drop. The most you can do here is check that the GPU is indeed plugged into PCIe 4.0 x16 connector. Every now and then we have people posting in this forum about a low-throughput PCIe link, and it turns out that they inadvertently plugged their GPU into x4 or x8 connector. Systems typically mark the x16 connectors suitable for GPUs very clearly on both the motherboard itself and in their documentation.

To solve your performance problem you could (1) use a GPU with significantly larger on-board memory ($$), (2) switch to a highest-end system that uses NVLINK instead of PCIe as the interconnect between CPU and GPU ($$$$). You may want to research the availability of such systems for rent in the cloud.

Thank you for response,

Yes my card is on x16 slot. All alternatives for me is to much $$ now, for while the only thing that i am trying to do is optmizing script for quantizantio or reduce memory during training to fit model in gpu memory. But it is not very easy too, because dont know if is possible quantize models from diffusers to FP8 in training, i found only for transformers models with Nvidia transformers, but if NVidia already did something like that i would to know, how to quantize a fine tune training of SDXL model.

Sorry, I cannot provide advice on models since I do not work on AI. If you are using specific AI applications or AI middleware, you might want to use the support infrastructure for those to solicit advice on how to minimize resource requirements.

While GPUs typically offer a better !/$ ratio compared to competing solutions, silver bullets generally do not exist: More performance still requires more $$.

maybe this section could help me ? Latest AI & Data Science/Deep Learning (Training & Inference) topics - NVIDIA Developer Forums