Parallel Reductions across the vertical dimension of the grid


Taking the best kernel of the reduction example of the SDK I tried to implement parallel reductions across the vertical dimension of the Grid. Basically, each reduction is carried out on 64 blocks of the x dimension and I multiple of these across the y dimension.

So the grid size is blocks in x: 64, threads per block 128. The y dimension varies depending on the number of reductions I want to perform. Reductions are independent.

The number of elements in each reduction instance is 60000.

If I increase the number of reductions, y, until 20, everything is fine, performance is great. If I go more than 20, I get an “invalid device pointer” error.

What could be the cause of this? Am I exceeding the number of blocks I can launch? I checked device memory usage and I am far from running out of memory.

What causes “invalid device pointer”?


Since I did not get any suggestions, I’ll try to be more specific.

The “Invalid device pointer” error is returned after a cudamemcpy operation from Device to Host. On previous to this cudamemcpy there other similar cudamemcpys that do not return any errors. At the time of the error there are almost 4GBs of memory available in the device and 5GB on the host. Is there any limitation with cudamemcpy? There is no kernel call before the cudamemcpy fails.

Please help!


You will probably get more responses if you post in the programming development forum rather than here, but I doubt you will get any useful replies until you post either the key parts of the code in question or a small test case that reproduces your problem. It is very hard to provide constructive suggestions otherwise.