I have two float arrays in global memory of the same size and initialized using the same way using cudaMalloc.
In my kernel, I have the following code (the arrays are called target and result). This is called at the end of the kernel code.
apply_transformation(myPoint &(target[tid * 3]));
apply_transformation(myPoint &(result[tid * 3]));
If I comment the second line out, the kernel runs 10 times faster! I have no idea why that should be though. It seems that writing to the “result” array is much much slower than writing to the “target” array. I am completely baffled as to why that is the case.
I have verified that each thread will write into the specific region of the array and there is no overlap between the threads and any contention in the writing.
Any ideas or suggestions greatly appreciated.
P.S: Another weird thing that I just noticed! When I run it using the Visual Profiler, with both the lines in there, I get 536 rows of output in the profiler…without the second line…I get 230 lines of output (function calls!)…What is going on!!! I see some extra massive data being copied under the entry memcpyHtoD and memcpyHtoA, which are not there when the second line is turned off…
What is memcpyHToA??