I have some code that performs many parallel prefix sums (scans). It takes in a matrix, then performs a scan on each row. The implementation is pretty simple: the y-dimension of the grid is mapped to the rows of the matrix.
I was trying to scale this up to very large matrices, and I observed the following curious behavior: When the number of rows exceeds a particular number, some values are not set properly. That particular number is
42 on a Tesla c2050
60 on a GTX285.
No errors are thrown and i’m well under the max grid dimensions. Even so, the fact that the number is different on different pieces of hardware makes me think I’m hitting some hardware limitation. I’m thinking it might be related to the max number of resident warps (or threads), my reasoning being that
on the tesla, there are 48 (max warps/MP) * 14 (MP) = 672 total max resident warps
on the GTX, there are 32 (max warps/MP) * 30 (MP) = 960 total max resident warps.
These numbers give the exact ratio at which my code is breaking! 672/960 * 60 = 42.
Any ideas? Is there some limitation on the total number of non-resident warps/threads? thanks!
PS: a bit more info in case it helps: block dimensions are 512 by 1; grid dimensions 1026 by r, where r is the number of rows.