I have a simple for loop going from 0 - N (N being 7000 - 10000). The code does some work on a CSR packed matrix (multiplies the elements in each column by their corresponding vector elements), each time incrementing tid (where tid = threadIdx.x + offset) by the number of elements in a particular column (offset). The for loop seems to work just fine - however, when the number of elements in my CSR packed matrix is greater than 1,000,000, it seems to bomb out towards the end of the matrix. And by bomb out, I mean, it starts returning zeros instead of the correct results.
I’m using a GTX 570 with 1.28GB global memory. In this particular instance, shared memory is not being used for calculations - though I plan to move what I can to shared memory after I solve this issue.
Any help or insight would be greatly appreciated.