Strange behaviour in CUDA

xargon · September 15, 2009, 11:27am

Hello,

I have two float arrays in global memory of the same size and initialized using the same way using cudaMalloc.

In my kernel, I have the following code (the arrays are called target and result). This is called at the end of the kernel code.

[codebox]

apply_transformation(myPoint &(target[tid * 3]));

apply_transformation(myPoint &(result[tid * 3]));

[/codebox]

If I comment the second line out, the kernel runs 10 times faster! I have no idea why that should be though. It seems that writing to the “result” array is much much slower than writing to the “target” array. I am completely baffled as to why that is the case.

I have verified that each thread will write into the specific region of the array and there is no overlap between the threads and any contention in the writing.

Any ideas or suggestions greatly appreciated.

Cheers,

/x

P.S: Another weird thing that I just noticed! When I run it using the Visual Profiler, with both the lines in there, I get 536 rows of output in the profiler…without the second line…I get 230 lines of output (function calls!)…What is going on!!! I see some extra massive data being copied under the entry memcpyHtoD and memcpyHtoA, which are not there when the second line is turned off…

What is memcpyHToA??

Tigga · September 15, 2009, 12:02pm

First problem - the cuda compiler (nvcc) is quite agressive at stripping out unused code. It may be that by commenting out this one line you are greatly reducing the amount of work that needs to be done. This could show improvements in terms of bandwidth, instructions and occupancy!

Second problem - really depends on your program. Is the VP showing the same amount of function calls as you see on the host? I would have thought that by altering the kernel you’re altering the convergence properties of the host code, or somesuch.

xargon · September 15, 2009, 12:09pm

Yes, I made sure that the code is not optimized out by the compiler. This is why the input to the apply_transformation function takes the same input. The calls are also the very last calls in the kernel. I am looking a bit deeper into the output of VP and will post more details.