GPU breaks down after error

hi all,

i have been trying to port my code to CUDA, and in the process i have to debug.

whenever my programs fails to run properly for a few times, i found that the GPU freezes up. i can still compile, but upon running the executable i get different but similar error messages each time it freezes up:

“could not allocate device memory”
“no CUDA-capable device is detected”

more recently,

“unspecific launch failure”, and after that the programs just seems to be stuck every time i try to execute it. using nvidia-smi, i can see that the gpu is at 100%, but no memory is occupied. usually the program should take up at least 30-40% memory.

my solution has been to restart the workstation. but since other users maybe using it, it is not a convenient solution.

so, is there a way ( a command maybe ), to restart just the gpu? or better yet, “re-initialize” the GPU?

i am using c1060 on a kde linux workstation, latest version of the driver and compiler.

Thanks!

hi all,

i have been trying to port my code to CUDA, and in the process i have to debug.

whenever my programs fails to run properly for a few times, i found that the GPU freezes up. i can still compile, but upon running the executable i get different but similar error messages each time it freezes up:

“could not allocate device memory”
“no CUDA-capable device is detected”

more recently,

“unspecific launch failure”, and after that the programs just seems to be stuck every time i try to execute it. using nvidia-smi, i can see that the gpu is at 100%, but no memory is occupied. usually the program should take up at least 30-40% memory.

my solution has been to restart the workstation. but since other users maybe using it, it is not a convenient solution.

so, is there a way ( a command maybe ), to restart just the gpu? or better yet, “re-initialize” the GPU?

i am using c1060 on a kde linux workstation, latest version of the driver and compiler.

Thanks!

I am currently debugging this issue on our machine. I still get a lot of complain from the user that the GPU can not be access suddenly.

Does the root does not have problem accessing the GPU when this happen? My root account does not have problem.

Machine running on CentOS5.5, c1060 card.

I will update if i found anything.

I am currently debugging this issue on our machine. I still get a lot of complain from the user that the GPU can not be access suddenly.

Does the root does not have problem accessing the GPU when this happen? My root account does not have problem.

Machine running on CentOS5.5, c1060 card.

I will update if i found anything.