So i was training my first CNN with tensorflow but at the 7th epoch i get this error message that stated “GPU sync failed”. I was looking for an answer and i found that it may be because my GPU ran out of memory (i’ve got a RTX 2060). And it’s very possible because i’ve left my PC alone, training my CNN, and i found my brother was playing a game, so that may have caused the lack of memory.
Beyond all of that, i have a few questions that surged after this episode:
Did the error message appear because i ran out of memory or could it be caused for other reason?
If i get the “GPU Sync failed” error message, that means i’m using my GPU for the training process, right? because it was going too slow and i was thinking that it was running on the CPU.
If that error message does not mean i’m using my GPU, how can i check that?
Running out of device memory is a likely cause of the error you mention.
Yes, your GPU appears to be getting used. You can run `nvidia-smi` from the terminal to get an idea of device utilization. Depending on how you preprocess your data, it is possible that the IO and CPU preprocessing is bottlenecking the GPU. If that is the case, you will see low utilization percentages.
Hopefully when no games are being played on the machine, the error doesn't happen and there won't be a need for alternative explanations.
Collecting a profile with nvprof or nsight systems would be a good starting point for understanding potential bottlenecks and what you can do to speedup training. As mention above, the IO/preprocessing pipeline is a common bottleneck. For that, make sure you are using the tf.data to avoid handling input data in directly in Python. You might also consider using DALI to perform data preprocessing on the GPU itself. You can find an examples of these approaches at https://github.com/NVIDIA/DeepLearningExamples/tree/master/TensorFlow/Classification/RN50v1.5
Through all this week i made a research and i could answer these questions, but others appeared.
First of all, i confirmed that i’m using my GPU using the nvidia-smi command, but the usage of the GPU was really bad (like 6 or 7%). So i set a parameter named “workers” to 16 (the number of my CPU’s threads) in the fit method and the usage of the GPU rised to 20%. I can conclude that there is indeed a bottleneck but i’m trying to find the other reasons that make my GPU usage low. I mean, 20% is better than 6% but it can be better.
I want to ask you if you can explain in detail what is that “tf.data” that you mentioned and where can i find more information about data preprocessing, because i found that if i load all the images in memory and then train my model, that would rise my GPU usage.
tf.data datasets allow you to express your preprocessing within the low-level TensorFlow computational graph and avoid unnecessary data copies and trips through the python interpreter.
Thank you very much. It seems that the low performance of my GPU is a matter of code and not a matter of configuration, because the GPU seems to be used.