Cuda out of memory error

pbi2 · December 12, 2023, 9:58pm

Hello.

I am trying to fine tune a model using autotrain. At some point it breaks with Cuda out of Memory Error. Here is the full error message:

ERROR train has failed due to an exception:
ERROR Traceback (most recent call last):
File “/anaconda/envs/customenv/lib/python3.10/site-packages/autotrain/utils.py”, line 280, in wrapper
return func(*args, **kwargs)
File “/anaconda/envs/customenv/lib/python3.10/site-packages/autotrain/trainers/clm/main.py”, line 168, in train
model = AutoModelForCausalLM.from_pretrained(
File “/anaconda/envs/customenv/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py”, line 566, in from_pretrained
return model_class.from_pretrained(
File “/anaconda/envs/customenv/lib/python3.10/site-packages/transformers/modeling_utils.py”, line 3480, in from_pretrained
) = cls._load_pretrained_model(
File “/anaconda/envs/customenv/lib/python3.10/site-packages/transformers/modeling_utils.py”, line 3870, in _load_pretrained_model
new_error_msgs, offload_index, state_dict_index = _load_state_dict_into_meta_model(
File “/anaconda/envs/customenv/lib/python3.10/site-packages/transformers/modeling_utils.py”, line 751, in _load_state_dict_into_meta_model
set_module_quantized_tensor_to_device(
File “/anaconda/envs/customenv/lib/python3.10/site-packages/transformers/integrations/bitsandbytes.py”, line 98, in set_module_quantized_tensor_to_device
new_value = bnb.nn.Params4bit(new_value, requires_grad=False, **kwargs).to(device)
File “/anaconda/envs/customenv/lib/python3.10/site-packages/bitsandbytes/nn/modules.py”, line 179, in to
return self.cuda(device)
File “/anaconda/envs/customenv/lib/python3.10/site-packages/bitsandbytes/nn/modules.py”, line 157, in cuda
w_4bit, quant_state = bnb.functional.quantize_4bit(w, blocksize=self.blocksize, compress_statistics=self.compress_statistics, quant_type=self.quant_type)
File “/anaconda/envs/customenv/lib/python3.10/site-packages/bitsandbytes/functional.py”, line 816, in quantize_4bit
out = torch.zeros(((n+1)//2, 1), dtype=torch.uint8, device=A.device)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacty of 31.74 GiB of which 87.31 MiB is free. Including non-PyTorch memory, this process has 31.65 GiB memory in use. Of the allocated memory 31.17 GiB is allocated by PyTorch, and 128.73 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

How can I solve this error?

The compute instance I use has 8 x NVIDIA Tesla V100 32GB vRAM

The autotrain command being used is this:

autotrain llm --train --project_name myprojectname --model meta-llama/Llama-2-70b-hf --data_path my/data/path/on/hf --use_peft --use_int4 --learning_rate 2e-4 --train_batch_size 2 --num_train_epochs 3 --trainer sft --model_max_length 4096 --token my_token

#nvidiainception

njuffa · December 13, 2023, 12:49am

Given that memory capacity is finite, out-of-memory conditions happen if you make a model large enough. You can:

(1) Reduce the memory needs of the model. Consult the documentation of you modelling framework or use the support infrastructure provided by its vendor: newsgroup, email list, online forum.

(2) Try the existing model “as-is” on GPUs with larger memory, such as A100 with 80 GB.

Topic		Replies	Views
Cuda error CUDA Programming and Performance	3	1899	June 23, 2021
Train with rl-games: Cuda out of memory Isaac Gym	5	2277	April 13, 2022
busGrind fail: out of memory CUDA Setup and Installation cuda	0	123	June 13, 2024
GPU Cuda out of memory error CUDA Programming and Performance gpu , gpu-computing	2	1382	July 7, 2023
Modulus release_22.09 - helmholz example fails with RuntimeError: CUDA out of memory on GeForce GTX 1650 Report a Bug (PhysicsNeMo Only)	3	1632	November 21, 2022
OutOfMemoryError CUDA Programming and Performance cuda , pytorch	1	807	March 13, 2024
Cuda allocate device memory failed CUDA Programming and Performance	0	1343	January 31, 2019
CUDA out of memory CUDA Programming and Performance cuda , deep-learning	1	1053	July 8, 2021
CUDA_ERROR_OUT_OF_MEMORY: out of memory on Nvidia Quadro 8000, with more than enough available memory Frameworks tensorflow	3	2845	October 6, 2020
GPU memory is empty, but CUDA out of memory error occurs CUDA Programming and Performance cuda	5	21520	September 19, 2024

Cuda out of memory error

Related topics