share information among shared memories

Hi folks:

To calculate an index, I need to do a dot production of two big arrays and then get the sum of the new array through reduction (there is an example in CUDA SDK for reduction). However, since I have 5,000 such indexes to calculate, I used two-dimensional blocks (A * 5,000 blocks) for the calculation. The reason I used A blocks to calculate one index is that I found out when the blocksize grows up too much the system performance actually goes down (I currently use a blocksize of 64). I use shared memory in each block for the dot production. The results in the shared memory of each block can then be reduced to one sum. However, I got trouble to add all the results from the shared memory from different blocks to get the final index. I understand shared memory from different blocks can’t communicate directly, what’s the most efficient way to solve this problem? Currently I output these intermediate results to global memory, and then use another global function call to get the indexes. I believe this is too time consuming.

Any suggestion is appreciated.

  1. Do not distribute the calculation of a single index across different blocks
  2. Use a single block to calculate multiple indices
  3. There is no need to use 2D grid here. Reduce the number of blocks to the number of MPs you have might be even better, if you have a way to distribute the work evenly among all the blocks
  4. Let’s say you have N MPs, then each block will just write a single sum to a global memory location, and your host code can read the N sums calculated by the N block, and do the final sum on the CPU.