I’m working on a sparse-matrix multiply kernel, where I give the card a sparse matrix and a block of dense vectors, and multiple the vectors by the sparse matrix. I’m running into issues when the block of dense vectors exceeds a certain amount.
AxB=C where A is sparse 102400x102400 with 100 elements per row, B and C are dense and 102400 elements long (and N vectors wide)
If B and C are 64 vectors wide then the code works. If they are 128 vectors wide, I get incorrect results and sometimes hard Linux crashes.
If I increase the characteristic dimension to 819200, then I can only submit 8 vectors at a time. 16 leads to incorrect results and crashes.
Is there a limit to the amount of contiguous device memory which can be allocated with cudaMalloc and copied into with cudaMemcpy? My threshold seems to be around 25 megabytes.
I’m using Cuda 1.0.