Cudnn64_8.dll is causing OOM error

jamisonworten · October 17, 2021, 10:15pm

So I’m currently working with GPT2 running on Tensorflow for text generation. I’m working with this repo specifically. I recently decided to install CUDA and cudnn to improve GPU capability and installed it via these instructions. I’m currently using Windows 10 x64 with NVIDIA Geforce GTX 1650 for my GPU and I’m using the command prompt terminal. I followed the instructions as best I could: downloaded the right GPU driver, set environment variables, copied cudnn files where they should go, etc. When I finished installing, I tried to generate an unconditional sample with the model I trained and I got an OOM error (would show it here, but it’s super long).

Assuming that I installed something incorrectly, I messed around with some of the CUDA files. I found that when I removed cudnn64_8.dll from C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.4\bin where I was told to copy it and then ran an unconditional sample, GPT2 worked just fine and was able to generate some text. All the other cudnn files were still in their CUDA directories.

Another thing I tried was adding TF_GPU_ALLOCATOR=cuda_malloc_async to the environment variables to see if it would fix it. I didn’t get an OOM error like last time, but it also terminated the program:

Microsoft Windows [Version 10.0.19043.1288]
(c) Microsoft Corporation. All rights reserved.

C:\Users\"username">cd C:\Users\"username"\Desktop\gpt-2-finetuning\src

C:\Users\"username"\Desktop\gpt-2-finetuning\src>python generate_unconditional_samples.py --model_name novel
2021-10-17 15:20:12.172740: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX AVX2
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2021-10-17 15:20:12.681534: I tensorflow/core/common_runtime/gpu/gpu_process_state.cc:215] Using CUDA malloc Async allocator for GPU: 0

C:\Users\"username"\Desktop\gpt-2-finetuning\src>

What exactly is going on here? Why would cudnn be eating up my GPU like this?

AakankshaS · October 19, 2021, 5:42am

Hi @jamisonworten ,
The repo you are following states that they are able to run it with slightly less than 12GB memory. You are using GTX 1650 which has like 4GB-6GB which is significantly smaller.
The repo documentation stated OOM is a known issue in the first paragraph. We suggest you may follow repo documentation and github issue pages to get better assistance as there might be some options to limit memory from the repo too.

Thanks!