I can run CUDA three times, then gpu stops responding

cayennext · May 2, 2013, 10:52pm

I can run a program that has cuda kernel inside only three times, and if I run a cuda program again, it hangs.

What kind of programs hangs:

Simplest cuda program ever, just an empty kernel call

#include "cuda.h"

__global__ void my_kernel()
{
  ;
}

int main()
{
  my_kernel<<<1, 1>>>();
  return 0;
}

$ nvcc -g -G simplest.cu -o simplest
$ ./simplest

any program from Sample SDK

I can run deviceQuery, then simplest, then deviceQuery - and it still hangs

nvidia-smi output before running any program, after running first, second and third and results are the same. Output of this file is here: http://pastebin.com/V3De5mZB

For example, when I run deviceQuery, I have output like this: http://pastebin.com/PmFbkrmT . This is history of launching deviceQuery. First, I have good output, on third I have good output with 10 second lag, and the launch after the lag I have:

cudaGetDeviceCount returned 10
-> invalid device ordinal

Additional info: using different card to render screen (integrated one). I have one ZOTAC GT630, which is based on Fermi chips. When I launch system, I do not have /dev/nvidiactl, /dev/nvidia0 and I create them by typing in term: sudo nvidia-script.sh , where I copied the code. I am using Ubuntu 12.04 32 bit version.

How to deal with it? I am so time-pressed to finish the job that I actually code reseting the machine ;)

vacaloca · May 3, 2013, 12:04am

I run Ubuntu, albeit x64 versions, and I’ve never had to create those /dev(ices) manually. If you can start over with a fresh install, use the drivers available in the Ubuntu repositories… You shouldn’t have any weird issues or have to create anything manually. If the default nvidia-drivers in 12.04’s repositories are too old for you, you can do either:

sudo apt-add-repository ppa:ubuntu-x-swat/x-updates
or
sudo add-apt-repository ppa:xorg-edgers/ppa
and then
sudo apt-get update
sudo apt-get install

where represents the nvidia driver version package(s) are available in the corresponding repositories listed above by browsing here:
[url]http://www.ubuntuupdates.org/ppa/ubuntu-x-swat[/url]
[url]http://www.ubuntuupdates.org/ppa/xorg-edgers[/url]

It also could be that your empty kernel just happens to leave the card in a weird/unreliable state. Have you tried other SDK examples? Do those work correctly? If so, I’d try to compile/run something that makes sense and isn’t doing just nothing and see if you have no issues then.

cayennext · May 3, 2013, 12:50am

I installed drivers that comes with CUDA, [url]https://developer.nvidia.com/cuda-downloads[/url], Ubuntu 11.04 version.

Empty kernel is the smallest test case I reduced problem to. Everything that I tried that made sense hangs the device after three runs.

vacaloca · May 4, 2013, 1:04am

The reason why I didn’t recommend installing the drivers that come with CUDA is because those standalone drivers don’t tend to behave well when your kernel gets updated – i.e. your updated kernel might boot without an NVIDIA module generated.

I’ll admit I don’t have experience with running off integrated graphics on a desktop since my board doesn’t have that capability. What I do know is that in my case, the Ubuntu repository versions install the device(s) automatically. Perhaps it doesn’t happen in your case because you’re driving your display with the integrated video… not sure if the repository packaged drivers would behave differently. The only other reason you might need to create those manually is if you’re running a headless server, which might or might not parallel driving the display with integrated graphics.

Other than what I’ve mentioned, my other suggestion is to switch motherboard slots (if possible) and try running the code and see if you have issues. I don’t think this card draws any external power from PCI-E 6/8 pin connectors, but you should also make sure that your power supply is stable.

To figure out the problem or discard possible culprits, I’d suggest trying a different O/S on that system (maybe even Windows, to discard any weird Linux issues that might come up – you can always make a backup with CloneZilla if need be) or even moving the card to another system and seeing if it works there… just some ideas for you to attempt to figure out where the issue lies.

Topic		Replies	Views
problems with cuda on linux CUDA Programming and Performance	13	22204	May 16, 2007
Issues running CUDA on GPU cluster CUDA Programming and Performance	3	895	February 18, 2017
CUDA is not active unless I run it with sudo privillages ? CUDA Setup and Installation	8	25321	January 13, 2018
Inexpiable CUDA hang (NOT WDM timeout!) CUDA Programming and Performance	2	1475	June 5, 2014
Bad Cuda Card? CUDA Programming and Performance	10	7092	January 4, 2012
CUDA 4.0 Runtime API is not working while Device API is working CUDA Programming and Performance	7	3976	June 3, 2011
Buying Nvidia Products is a Serious Waste of Money: They Don't Work CUDA Developer Tools	0	437	June 26, 2020
Amazon Ubuntu 16.04 P3 instances only run kernel once then crash server CUDA Setup and Installation	11	1865	January 17, 2018
Running CUDA programs without starting X server CUDA Programming and Performance	8	8707	December 8, 2020
"no CUDA-capable device is detected" with CUDA GPU attached CUDA Setup and Installation	1	11597	June 24, 2014

I can run CUDA three times, then gpu stops responding

Related topics