Ran a experiment with two implementations of a 64 bit double scan/reduction which returns the minimum element with the FIRST occurrence of that min element(there may be multiple min values).
Also calculate the GB/s for each implementation as well as the average running time.
Both implementations attempt to read coalesced memory from the global double array, but use two different methods.
The first uses the standard
for(idx=threadIdx.x+blockIdx.x*blockDim.x;idx<num_elements;idx+=blockDim.x*gridDim.x)..
method which limits the number of thread blocks to exactly fit the number of SMs.
The second method uses templates to somewhat dynamically calculate the work per thread and uses that value in the kernel.(Very similar to the method that Jimmy P uses in his reduction code).
What surprised me was that the latter implementation was about 10% faster than the supposed ‘optimal’ former implementation.
for example an array of 16,777,343 this is the output (averaged over 100 reps)
Num Elements= 16777343
Starting CPU..
CPU time= 879.198 , CPU ans= -987654 , index= 9997
CPU throughput= 15.266 GB/s
Starting GPU0..
Num threads= 256 , total num blocks= 104
GPU time0= 82.505 , GPU ans= -987654 , index= 9997
GPU throughput0= 162.68 GB/s
Starting GPU1..
Num threads= 256 , total num blocks= 513
GPU time2= 76.2341 , GPU ans2= -987654 , index2= 9997
GPU throughput1= 176.061 GB/s
timing is in microseconds.
The CPU used is a I-7 3770K 3.9 GHz 4-core, and the GPU is the Tesla K20c in W7 64.
here is the sample code:
the first implementation starts on line #241 while the second starts on line #306.
What is even more interesting is that the ‘faster’ second implementation has two parts, while the first does not( uses instead __threadfence() ). Maybe that is the issue.
Maybe I made a mistake in calculating the correct number of max blocks of 256 for the K20c. As I understand it it has 13 SMs, each of which can handle 2048 threads.