CUDA kernel slow/ times out when applying values to results array


I am new to CUDA and have a question which maybe one of you guys can help with.

I am basically doing a tri-tri intersection using CUDA.

In the kernel code, a triangle searches a list of other triangles and gets the closest one
based on centre to centre distance.

Say, the closest distance calculated is updated in the loop to be a float called mgdist.

If I set the results array, say d_close[i] = mgdist, at the end of the loop the code runs really slow.
If I set it to be say the first node of the triangle it points to eg d_close[i]=node0[i]
then it is fine.

Is there anything obvious that rings a bell with anyone ?

Surely the compiler is not that clever that it doesnt bother with the loop if mgdist is not used by a
device result array ??

Cheers Guys

The CUDA compiler will aggressively eliminate dead code, that is, all computation that does not ultimately contribute to data written to global memory. From your description, it sounds like this is what is causing the behavior you observe. Impossible to know for sure without having buildable and runnable code to reproduce it. You can disassemble the generated binary with cuobjdump --dump-sass to find out what happens to the code for each of the two variants.