SpMV code from the guide seems not to be working for large matrices

Hello there,

First of all, let me point out im relatively new to the world of GPU programming but still

This is a SpMV for CSR as it is in the programming guide.

global void SpMV(
const float * csrNz_d, const int * csrCols_d,
const int * csrRowStart_d, const float * x_d, float * y_d,
const int num_rows
)
{
int row = blockIdx.x * blockDim.x + threadIdx.x;
if(row < num_rows ){
float dot = 0; //or float ?!?!
int row_start = csrRowStart_d[ row];
int row_end = csrRowStart_d[ row +1];
for (int jj = row_start ; jj < row_end ; jj ++)
dot += csrNz_d[jj] * x_d[ csrCols_d[jj ]];
y_d[ row ] += dot;
}
}

This is part of the main () :
int block_size = 512;
int n_blocks = dim.M/block_size + (dim.M%block_size == 0 ? 0:1);
SpMV <<< n_blocks, block_size >>> (csrNz_d, csrCols_d, csrRowStart_d, x_d, y_d, dim.M);

suing the terms in the guide that would be:
SpMV <<< n_blocks, block_size >>> (data, indices, ptr, x, y, num_rows);

Well, this seems to be perfectly working for relatively small matrices of 100 000 elements and sizes of around 50 000 x 50 000.
However, when I input a bigger matrix, e.g. 480 000 x 171 000 with approx 6 million non-zero elements, the returned vector is all zeros. I have tried a lot of different matrices, it only works for the smaller ones. I have placed error catching statements after each device statement, however, it does not report anything. It simply returns my y_d vector of all 0 elements.

I’m using a 8600GT M GPU.

Any suggestions why this could be happening ?

Cheers,

The display driver has a 5 second watchdog timer for CUDA kernels running on a GPU with an attached display. It is quite likely that the larger matrices take more than 5 seconds to execute and are getting killed by the driver. If you add some error handling after the kernel launch and check the return status things will become clearer.

It seems that it takes less than 5 seconds, because I have printf() before and after the SpMV <<< n_blocks, block_size >>> (csrNz_d, csrCols_d, csrRowStart_d, x_d, y_d, dim.M); and they happen almost instantly ( a lot less << than 5s)

An error handling as

err = cudaGetLastError();

if (err != cudaSuccess){

printf("!ERROR: %s\n", cudaGetErrorString( err ) );

system("pause");

exit(EXIT_FAILURE);

} after it, does not catch anything.

Kernel launches are asynchronous, so your “happen almost instantly” is because the kernel has only been queued for launch, not executed. As a result, your error checking probably won’t catch the error, because it is checking too early. Try this:

err = cudaThreadSynchronize();

if (err != cudaSuccess){

	printf("!ERROR: %s\n", cudaGetErrorString( err ) );

	system("pause");

	exit(EXIT_FAILURE);

}

and I suspect you might see something different.

Thanks avidday!

However, the output says “Unknown Error”, which is not very definitive.

Does that mean the kernel is queued but not executed because it would take more than 5s ?

Initially I thought it might be because of memory or block limits but I have checked everything and it goes as 82MBs of memory for the variables to be copied to the device and the number of blocks is 799 of 512 threads each.

Cheers

I would have thought you would get a cudaErrorLaunchTimeout for a watchdog timer problem, but I am not 100% sure about that. There certainly isn’t any sort of speculative or predictive protection for execution timeouts. Every kernel which can be launched (ie all execution parameters are valid) will be. If it takes longer than 5 seconds, the watchdog timer will intervene. The only workarounds are either reduce the problem size or use a gpu without an active display attached (ie. a dedicated compute GPU). Invalid execution parameters or resource exhaustion should (at least in my experience) generate something other than cudaErrorUnknown.

All I can suggest is to wrap the kernel call and error checking code in a host side timer and see what the actual execution time is. Everything points to the kernel execution time exceeding 5 seconds (the watchdog timer certainly is 5 seconds on Linux, I don’t use other CUDA platforms much so it could be different) and being a watchdog timer problem, but there could be something else too.

I am working on Windows. When I run the bigger matrices my screen flashes very quickly thru the execution stage. Maybe that is an indication of it overrunning the 5 seconds. Unfortunately, I could not understand how to implement a timer. I have tried the warped version that is present in the cuda programming guide which is:

[codebox]global void

spmv_csr_vector_kernel (

					const float * data,

					const int * indices,

					const int * ptr,

					const float * x,

					float * y,

					const int num_rows

					)

{

__shared__ float vals[512];

int thread_id = blockDim.x * blockIdx.x + threadIdx.x ; // global thread index

int warp_id = thread_id / 32; // global warp index

int lane = thread_id & (32 - 1); // thread index within the warp

// one warp per row

int row = warp_id ;

if ( row < num_rows ){

int row_start = ptr [row ];

int row_end = ptr [ row +1];

// compute running sum per thread

vals [ threadIdx.x ] = 0;

for ( int jj = row_start + lane ; jj < row_end ; jj += 32)

vals [ threadIdx.x ] += data [jj] * x[ indices [jj ]];

// parallel reduction in shared memory

if ( lane < 16) vals [ threadIdx.x ] += vals [ threadIdx.x + 16];

if ( lane < 8) vals [ threadIdx.x ] += vals [ threadIdx.x + 8];

if ( lane < 4) vals [ threadIdx.x ] += vals [ threadIdx.x + 4];

if ( lane < 2) vals [ threadIdx.x ] += vals [ threadIdx.x + 2];

if ( lane < 1) vals [ threadIdx.x ] += vals [ threadIdx.x + 1];

// first thread writes the result

if ( lane == 0)

y[ row ] += vals [ threadIdx.x ];

}

}[/codebox]

It behaves the same way as the previous one. Now I would really want to time the execution on the GPU properly but I assume the clock() function would not be the appropriate way to do it. ( It always returns 0.000s ).

“Screen flashes” could also potentially be out of bounds memory access (basically you could be getting the GPU equivalent of a segfault). I really can’t help with the Windows host code, I am afraid, and I have minimal experience running CUDA on anything other than Linux, so I don’t really know what various symptoms “look” like on other OSs. But obviously something pretty fatal is going on that is resulting in execution failure. Beyond that diagnosis I am not sure what else to suggest.

I think this may be the problem I am seeing. I call cudaThreadSynchronize() after the kernel launch and get the error 6 = cudaErrorLaunchTimeout. This indeed only happens when execution times exceed 5s - but then not always. One particular test case sometimes completes correctly having taken 12s on the GPU. However no single thread should be taking more than a few ms to complete. Perhaps the display driver is managing to schedule enough time on the device at the same time as my code (This is a GeForce 260, under CentOS 5, 2.3 driver), and the watchdog only triggers if it fails to do so for 5s? If this is right then the problem should go away when I run on a Tesla.

If that is 12s for a single kernel launch I would be questioning your timing. The 5 second timer in Linux is pretty reliable in my experience.

It doesn’t work like that on current hardware. When a compute context has the hardware, it has all the hardware until it either yields or gets terminated by the watchdog timer. There is no timesharing. With a long running kernel, you should notice the screen freeze until the kernel is finished or terminated. There will be no screen refresh during that time.

Yes. If you use a dedicated compute card without an active display manager using it, the watchdog timer issue disappears.

‘Unknown error’ usually means a segfault. Your are writing beyond limits either in shared memory or globl memory…