kernel works on Gtx280/295/480 but not on C2050 unspecified launch failure

Hi!

I get “unspecified launch failure” with a kernel on a Tesla C2050. The same kernel works on my GTX cards, also on a 480.
So the kernel is long and complicated, I rather don’t post it. Are there any common mistakes?

I have read in one thread here, that bad shared memory usage can be a problem on Fermi cards, but then it shouldn’t work on the 480 either, right? Also, I use the shared memory directly [0],[1],… and not with a variable index, so “out of boundary” should not be a problem here.

Thanks for any help,
Philipp.

Your C2050 and GTX480 are effectively identical from a CUDA coding and toolchain point of view, so it might well be that you either have a hardware problem with the C2050, or a toolchain/driver/configuration problem (if the C2050 is not installed in the same machine as the GTX480 is).

Your C2050 and GTX480 are effectively identical from a CUDA coding and toolchain point of view, so it might well be that you either have a hardware problem with the C2050, or a toolchain/driver/configuration problem (if the C2050 is not installed in the same machine as the GTX480 is).

Hi,

Ok, so both machines have two cards, two 480 / two Tesla, respectively. The crazy thing is: other programs work fine on the Tesla. But I will check the driver versions, thank you!

Philipp.

Edit:
the GTX480 says (deviceQuery from the sdk)
CUDA Driver Version: 3.10
CUDA Runtime Version: 3.0

the Tesla says
CUDA Driver Version: 3.10
CUDA Runtime Version: 3.10

Hi,

Ok, so both machines have two cards, two 480 / two Tesla, respectively. The crazy thing is: other programs work fine on the Tesla. But I will check the driver versions, thank you!

Philipp.

Edit:
the GTX480 says (deviceQuery from the sdk)
CUDA Driver Version: 3.10
CUDA Runtime Version: 3.0

the Tesla says
CUDA Driver Version: 3.10
CUDA Runtime Version: 3.10

What operating system is this and are you using the C2050 display output? It could be that the C2050 is just slow enough compared to the GTX480 that it is getting hit with the driver watchdog timer when the GTX480 finishing a little bit faster.

What operating system is this and are you using the C2050 display output? It could be that the C2050 is just slow enough compared to the GTX480 that it is getting hit with the driver watchdog timer when the GTX480 finishing a little bit faster.

It’s not my machine, I’m just using it via ssh. So I don’t know about the graphical output. It is running Ubuntu on all our systems.
Is there any simple command I can type at the beginning of the code, so the whole program will run on GPU #1 instead of GPU #0?

The kernel runtime is smaller than a millisecond. I think the watchdog can’t be the problem, correct?

It’s not my machine, I’m just using it via ssh. So I don’t know about the graphical output. It is running Ubuntu on all our systems.
Is there any simple command I can type at the beginning of the code, so the whole program will run on GPU #1 instead of GPU #0?

The kernel runtime is smaller than a millisecond. I think the watchdog can’t be the problem, correct?

cudaSetDevice() will let you select a given GPU. At 5ms there is no way the watchdog can be having anything to do with it. Is the C2050 machine running other kernels at the same time, and are the GPUs set to compute exclusive? There is a known bug in the Linux driver for multi-gpu machines running in compute exclusive mode with kernels which use a lot of registers - discussed here. It could be that…

cudaSetDevice() will let you select a given GPU. At 5ms there is no way the watchdog can be having anything to do with it. Is the C2050 machine running other kernels at the same time, and are the GPUs set to compute exclusive? There is a known bug in the Linux driver for multi-gpu machines running in compute exclusive mode with kernels which use a lot of registers - discussed here. It could be that…

OK, using the other device did not change anything.

I tried to uncomment lines, so now I know where the error occurs. I read from an array in global memory. The number of reads is basically random, but with the parameters now it is about 10 times starting with array[9999] down to array[9990] (or whatever). I used printf output to debug, so I put some lines into the kernel with “manual” access to those points.

[codebox]

printf(“[9999].w = %f\n”,Particles[9999].w);

printf(“[9998].w = %f\n”,Particles[9998].w);

[/codebox]

This works.

If I let the threadID address the arrays in the printf,

[codebox]

printf(“[%d].w = %f\n”,9999-tid, Particles[9999-tid].w);

[/codebox]

it doesn’t work and I get the launch failure.

The crazy thing is: This is only a problem in blockId=0, all other blocks do the same (just with an offset of 10000 in the array index), such as array[19999] down to array[19990]. Only block=0 creates the problems.

Any ideas?

OK, using the other device did not change anything.

I tried to uncomment lines, so now I know where the error occurs. I read from an array in global memory. The number of reads is basically random, but with the parameters now it is about 10 times starting with array[9999] down to array[9990] (or whatever). I used printf output to debug, so I put some lines into the kernel with “manual” access to those points.

[codebox]

printf(“[9999].w = %f\n”,Particles[9999].w);

printf(“[9998].w = %f\n”,Particles[9998].w);

[/codebox]

This works.

If I let the threadID address the arrays in the printf,

[codebox]

printf(“[%d].w = %f\n”,9999-tid, Particles[9999-tid].w);

[/codebox]

it doesn’t work and I get the launch failure.

The crazy thing is: This is only a problem in blockId=0, all other blocks do the same (just with an offset of 10000 in the array index), such as array[19999] down to array[19990]. Only block=0 creates the problems.

Any ideas?

The obvious thing is that tid takes the array indices negative or out of bounds. As for what/how to fix it, well it is your code…

The obvious thing is that tid takes the array indices negative or out of bounds. As for what/how to fix it, well it is your code…

;)
I checked the indices with an if statement, no out of bounds…

;)
I checked the indices with an if statement, no out of bounds…

Try running the code with cuda-memcheck on the C2050.

Try running the code with cuda-memcheck on the C2050.

I will do that, thanks!

Now :wacko: :wacko:

This does not work:

[codebox]

if(tid==0) printf(“[0].w = %f \n”,Particles[max_number_of_particles_per_cell-1 -tid].w);

[/codebox]

And this does work:

[codebox]

if(tid==0) printf(“[0].w = %f \n”,Particles[max_number_of_particles_per_cell-1].w);

[/codebox]

tid is an integer! I give up!