CUSP losses performance in CUDA RC3.2

Hi,

I was trying to do some performance comparison tests between CUSP and CUSPARSE, so I installed the 3.2 RC version of driver and cuda toolkit. Problem is that now the performance I get from CUSP is worse than with 3.0 version. Anyone having this issue? Is it possible to use CUSPARSE with 3.0 version of driver and toolkit?

Hi,

I was trying to do some performance comparison tests between CUSP and CUSPARSE, so I installed the 3.2 RC version of driver and cuda toolkit. Problem is that now the performance I get from CUSP is worse than with 3.0 version. Anyone having this issue? Is it possible to use CUSPARSE with 3.0 version of driver and toolkit?

Can you provide some details about what became slower and by how much? Specifically, what steps do I need to follow to observe the performance degradation?

Which version of Cusp are you using? Have you tried the most recent development version?

Can you provide some details about what became slower and by how much? Specifically, what steps do I need to follow to observe the performance degradation?

Which version of Cusp are you using? Have you tried the most recent development version?

Oh, sorry, forgot this thread at all.

I’m using last version of CUSP (Cusp 0.1.2 with Thrust 1.3.0). I was trying to measure CUSP performance on sparse MV multiplications. This is more or less the code :

// initialize matrix

      cusp::csr_matrix<int, float  , cusp::host_memory>matrix_mm(N,N,NNZ); //allocate memory

      cusp::io::read_matrix_market_file(matrix_mm, argv[1]); //load matrix

// transfer + matrix conversions 

      cusp::csr_matrix<int,float  ,cusp::device_memory> d_matrix_csr=matrix_mm;

      cusp::coo_matrix<int,float  ,cusp::device_memory> d_matrix_coo=matrix_mm;

      cusp::hyb_matrix<int,float  ,cusp::device_memory> d_matrix_hyb=matrix_mm;

// compute y = A * x

      gettimeofday (&start, NULL);

      for (i=0;i<iteraciones;i++)

        cusp::multiply(d_matrix_csr, d_x, d_y);

      cudaThreadSynchronize();

      gettimeofday (&end, NULL);

      tiempo_csr = ((end.tv_sec - start.tv_sec)*1000000 + (end.tv_usec - start.tv_usec)) / iteraciones;

printf ("CSR: %.2f GFLOPs\n", (2.*NNZ - num_filas_no_vacias) / (tiempo_csr*1e3));

And so on for the rest of formats.

Can not tell you now exactly the loss of performance, as I’ve come back to 3.0 version. But it was quite serious, maybe around 30% or so. I tried both to execute the program previously compiled with 3.0 version and to compile again with 3.2, but results were similar in both two cases.

Oh, sorry, forgot this thread at all.

I’m using last version of CUSP (Cusp 0.1.2 with Thrust 1.3.0). I was trying to measure CUSP performance on sparse MV multiplications. This is more or less the code :

// initialize matrix

      cusp::csr_matrix<int, float  , cusp::host_memory>matrix_mm(N,N,NNZ); //allocate memory

      cusp::io::read_matrix_market_file(matrix_mm, argv[1]); //load matrix

// transfer + matrix conversions 

      cusp::csr_matrix<int,float  ,cusp::device_memory> d_matrix_csr=matrix_mm;

      cusp::coo_matrix<int,float  ,cusp::device_memory> d_matrix_coo=matrix_mm;

      cusp::hyb_matrix<int,float  ,cusp::device_memory> d_matrix_hyb=matrix_mm;

// compute y = A * x

      gettimeofday (&start, NULL);

      for (i=0;i<iteraciones;i++)

        cusp::multiply(d_matrix_csr, d_x, d_y);

      cudaThreadSynchronize();

      gettimeofday (&end, NULL);

      tiempo_csr = ((end.tv_sec - start.tv_sec)*1000000 + (end.tv_usec - start.tv_usec)) / iteraciones;

printf ("CSR: %.2f GFLOPs\n", (2.*NNZ - num_filas_no_vacias) / (tiempo_csr*1e3));

And so on for the rest of formats.

Can not tell you now exactly the loss of performance, as I’ve come back to 3.0 version. But it was quite serious, maybe around 30% or so. I tried both to execute the program previously compiled with 3.0 version and to compile again with 3.2, but results were similar in both two cases.