Issues with VRAM allocation while fine tuning LLM

Hello. I am experiencing issues with VRAM allocation while fine tuning LLM. I have a system with two NVIDIA GeForce RTX 5070 GPUs. The GPUs are capable of doing graphics acceleration and running LLM inferences, but during training the VRAM is not being allocated as expected. The GPUs are recognized and functional for tasks like graphics rendering and LLM inference. VRAM allocation works as expected for running LLM inference and gaming. Some basic info on my environment, I am using PyTorch, both GPUs show VRAM utilization with nvidia-smi, using training library NCCL, and I am attempting to do data parallelism as well as model parallelism. The Driver version installed is 570.169. The CUDA version installed is 12.8.

Any & all help is appreciated.

you may get more attention posting here: CUDA - NVIDIA Developer Forums
you will need to provide waaay more details though…