In way 1, writing to global memory is random; While writing to global memory is contiguous in way 2. So, way 2 should be faster than way 1.
But the timing results are similiar. I wonder the reason. It doesn’t make sense. And, I check the PTX code and find all the local variables stored in registers.
But register is fast for access. Anybody who can give me some advice? Thanks very much!
In my opinions I would check two main points below.
The size of processed data, this size may be too small to see the advantages of coherece reading and writing.
Get the counters from CUDA PROFILER, there are many utilize counters to evaluate a cuda program. In this case you should pay attention at incoherence in loading and storing, warp serials (bank conflicts).
good luck. :)
And I use 16 registers per thread and 12352Bytes shared memory per block.
And occupancy is 50%, because the shared mem is more than 8KB.
My GPU is GTX 280.
(Although, every operation has at least a 12 clock latency due to register potential read-after-write access stalls.
This is why we should schedule at least 192 (active) threads per SM. since 16 threads operate in one clock,
so 12*16=192 threads will cover this 12 clock latency. )
But in my case, I have already used 512 thread and 256 active threads per SM. Why I still have a long register latency?
So, for the way1, is it the problem caused by loop while. It makes access glob mem too many times in one thread.
If write to global memory randomly only one time for one thread, I think it should be faster.
Can anyone give me some advice about this? Thanks a lot!