Taking the best kernel of the reduction example of the SDK I tried to implement parallel reductions across the vertical dimension of the Grid. Basically, each reduction is carried out on 64 blocks of the x dimension and I multiple of these across the y dimension.
So the grid size is blocks in x: 64, threads per block 128. The y dimension varies depending on the number of reductions I want to perform. Reductions are independent.
The number of elements in each reduction instance is 60000.
If I increase the number of reductions, y, until 20, everything is fine, performance is great. If I go more than 20, I get an “invalid device pointer” error.
What could be the cause of this? Am I exceeding the number of blocks I can launch? I checked device memory usage and I am far from running out of memory.
What causes “invalid device pointer”?