Shared Vram on linux --- super huge problem

Is there like a Pre-Alpha 580 driver I can test for verifying that System backed VRAM on linux (does or doesn’t) work … is this even being worked on? There is a major post on this with absolutely NO response from NVIDIA …

Intel and AMD can do it …

3 Likes

ANYBODY !?

1 Like

I don’t know why this isn’t a priority for them, the majority of people this affects are professional users; people in ML/AI fields and for those fields Linux seems more popular..

1 Like

Absolute tragedy of customer care

1 Like

@dartefi

Is it possible to use AMD with example LLMs? From what i know that don’t really need much compute at all just vram.

Things like stable diffusion uses full cuda compute i think, so might not be easy to replace. Few hundred mb over and OOM. Big problem. Maybe AMD work for this too but slower? I’d take a bit slower vs not being able to compute at all.

Or if anyone else knows?

perf of LLMs directly depends on the compute power of GPUs they run on. period.
The amount of VRAM limits the size of the models you are able to run, which VERY roughly speaking, translates to the quality of answers you will get. Also, the bigger the model (in terms of the number of layers/parameters), the (mostly) linearly slower it gets.

Whether you can use a specific brand of GPUs (be it AMD, NVIDIA, Intel, others) depends on which backends (CUDA, ROCm, Vulkan etc) your engine (ollama, llama.cpp, vLLM etc) supports and specific model of your card: for example Vulkan is supported by AMD’s consumer models, but the DC models only support ROCm.
Speaking specifically, both ollama and llama.cpp support both Vulkan and ROCm, so you can run LLMs using these engines on AMD cards without problems in most cases. The only problem I’m aware of is if you have a DC AMD model (say MI60 / MI100 / MI210) that does not support Vulkan, connected as an eGPU using TB, because PCIe tunneling over TB does not support atomic operations that ROCm needs…

Whether AMD or Nvidia will give you more tokens-per-second of course depends on the specific card models and to make such comparisons fair, you need to look at models from similar price categories, which is not easy due to different computer-power to VRAM ratios:

  • RTX 5090 has the same amount of VRAM (32GB) as Radeon PRO R9700, 5090 is almost twice as fast and costs roughly twice as much.
  • RTX5080 has similar price as R9700, they provide similar perf (5080 is slightly faster), but 5080 only has 16GB VRAM.

Of course among the cards available on the consumer market, RTX PRO 6000 is the undisputed king with its 96GB, but it costs roughly the similar amount as 6-7 Radeon PRO R9700 cards…

2 Likes

When using LLMs in my setup, the only diffrence to speed is caused my vram it looks like. Actuall gpu usage itself dont really change much and stays mostly under 50%. When processing prompt before responding it goes to around 80% but that is so short time i would say it kind of dont count. When responding it stays only around 30-40%. Not using thinking for the model.

It makes me think maybe i just dont have enough vram to cause it to use full gpu compute power. It also kind of tells me raw compute isnt as much needed with LLMs. Using cpu only for some of the models was surprisingly fast.

As for shared vram, it think it was issue with the program on linux. Maybe on linux programs needs to instruct to offload some to ram.

The stable diffusion program i used worked fine on windows but not on linux under same workload, it caused OOM. I have since updated to something else and works even better than what i used before. It also was an old not maintained version. Could be the main cause to the issues.

Thank you for taking time with your reply.