Questions about GPU Sync failed

adrian.andres.nanin · July 13, 2019, 11:39pm

Hi everyone!

So i was training my first CNN with tensorflow but at the 7th epoch i get this error message that stated “GPU sync failed”. I was looking for an answer and i found that it may be because my GPU ran out of memory (i’ve got a RTX 2060). And it’s very possible because i’ve left my PC alone, training my CNN, and i found my brother was playing a game, so that may have caused the lack of memory.

Beyond all of that, i have a few questions that surged after this episode:

Did the error message appear because i ran out of memory or could it be caused for other reason?
If i get the “GPU Sync failed” error message, that means i’m using my GPU for the training process, right? because it was going too slow and i was thinking that it was running on the CPU.
If that error message does not mean i’m using my GPU, how can i check that?
What can i do to train my CNN faster?

That would be all, thanks!

nluehr · July 23, 2019, 3:38pm

Hi Adrian,

Running out of device memory is a likely cause of the error you mention.
Yes, your GPU appears to be getting used. You can run `nvidia-smi` from the terminal to get an idea of device utilization. Depending on how you preprocess your data, it is possible that the IO and CPU preprocessing is bottlenecking the GPU. If that is the case, you will see low utilization percentages.
Hopefully when no games are being played on the machine, the error doesn't happen and there won't be a need for alternative explanations.
Collecting a profile with nvprof or nsight systems would be a good starting point for understanding potential bottlenecks and what you can do to speedup training. As mention above, the IO/preprocessing pipeline is a common bottleneck. For that, make sure you are using the tf.data to avoid handling input data in directly in Python. You might also consider using DALI to perform data preprocessing on the GPU itself. You can find an examples of these approaches at https://github.com/NVIDIA/DeepLearningExamples/tree/master/TensorFlow/Classification/RN50v1.5

adrian.andres.nanin · July 23, 2019, 4:34pm

Thanks for your response, nluehr.

Through all this week i made a research and i could answer these questions, but others appeared.

First of all, i confirmed that i’m using my GPU using the nvidia-smi command, but the usage of the GPU was really bad (like 6 or 7%). So i set a parameter named “workers” to 16 (the number of my CPU’s threads) in the fit method and the usage of the GPU rised to 20%. I can conclude that there is indeed a bottleneck but i’m trying to find the other reasons that make my GPU usage low. I mean, 20% is better than 6% but it can be better.

I want to ask you if you can explain in detail what is that “tf.data” that you mentioned and where can i find more information about data preprocessing, because i found that if i load all the images in memory and then train my model, that would rise my GPU usage.

nluehr · July 23, 2019, 5:37pm

tf.data datasets allow you to express your preprocessing within the low-level TensorFlow computational graph and avoid unnecessary data copies and trips through the python interpreter.

Basic documentation is here: Module: tf.data | TensorFlow v2.10.0

Here is a guide for using datasets with Keras that may also be helpful. The Sequential model | TensorFlow Core

adrian.andres.nanin · July 23, 2019, 5:50pm

Thank you very much. It seems that the low performance of my GPU is a matter of code and not a matter of configuration, because the GPU seems to be used.

Again, thank you!

Topic		Replies	Views
GPU functioning only at 16% with CUDA and cuDNN installed (Geforce GTX 750 Ti) CUDA Programming and Performance	5	2614	May 26, 2018
GPU Sync failed in TX2 when running Tensorflow Jetson TX2	7	5277	October 18, 2021
Hard shutdown problem on ubuntu16.04 with 2080Ti Frameworks tensorflow	3	553	September 23, 2019
Error while using Tensorflow GPU CUDA Programming and Performance	2	2877	June 15, 2021
Slow training of neural networks on GPU CUDA Programming and Performance	17	3986	April 21, 2021
Keras with Tensorflow backend - NN training on GPU is almost 10 times slower than CPU Frameworks tensorflow	4	1564	April 11, 2020
Why can't I train with GPU after installing tensorflow? Jetson Orin NX tensorflow	4	648	January 17, 2024
GPU crashes when running machine learning models cuDNN	2	6057	October 12, 2021
Low GPU usage on tensorflow (RTX 3090) Frameworks	1	1803	August 11, 2021
GPU is lost during execution of either Tensorflow or Theano code CUDA Programming and Performance	12	12653	March 8, 2020

Questions about GPU Sync failed

Related topics