Issues with VRAM allocation while fine tuning LLM

Hello. I am experiencing issues with VRAM allocation while fine tuning LLM. I have a system with two NVIDIA GeForce RTX 5070 GPUs. The GPUs are capable of doing graphics acceleration and running LLM inferences, but during training the VRAM is not being allocated as expected. The GPUs are recognized and functional for tasks like graphics rendering and LLM inference. VRAM allocation works as expected for running LLM inference and gaming. Some basic info on my environment, I am using PyTorch, both GPUs show VRAM utilization with nvidia-smi, using training library NCCL, and I am attempting to do data parallelism as well as model parallelism. The Driver version installed is 570.169. The CUDA version installed is 12.8.

Any & all help is appreciated.

you may get more attention posting here: CUDA - NVIDIA Developer Forums
you will need to provide waaay more details though…

Hi Tom, and thank you for the warm welcome!

​I’m excited to share this vision with the community. The HPR (Hybrid Predictive Rendering) Architecture was born from a logical analysis of where hardware is heading—specifically looking at how NVIDIA’s new CPUs can act as the perfect ‘logistics engine’ for current and future GPUs.

​The core idea is to move beyond brute force and use the NPU’s intelligence to manage data flow, making technologies like DLSS 5 and Neural Texture Compression even more seamless within a closed, proprietary ecosystem.

​I am currently refining the technical documentation. I’m open to discussing the high-level logic here, but I’d also welcome a more private channel for the deeper architectural specifics when the time is right.

​Looking forward to the feedback from the community!