To calculate an index, I need to do a dot production of two big arrays and then get the sum of the new array through reduction (there is an example in CUDA SDK for reduction). However, since I have 5,000 such indexes to calculate, I used two-dimensional blocks (A * 5,000 blocks) for the calculation. The reason I used A blocks to calculate one index is that I found out when the blocksize grows up too much the system performance actually goes down (I currently use a blocksize of 64). I use shared memory in each block for the dot production. The results in the shared memory of each block can then be reduced to one sum. However, I got trouble to add all the results from the shared memory from different blocks to get the final index. I understand shared memory from different blocks can’t communicate directly, what’s the most efficient way to solve this problem? Currently I output these intermediate results to global memory, and then use another global function call to get the indexes. I believe this is too time consuming.
Any suggestion is appreciated.