I have a code that evaluates Nth degree polynominal. The major work is performed in a kernel that performs sparse matrix-vector multiplication (SpMV). After having implemented the GPU version of the algorithm, which is 20-40x faster than the CPU counterpart, I tried to improve it using streams. Suprisingly, this led to a dramatical performance drop, this version is as much slow as the CPU one. I tried nvprof to see what is going on, and here I saw something strange that is my actual question here.

The profiling result for the version with 2 streams:

==12188== NVPROF is profiling process 12188, command: Release/SpMV_streams.2
     80.5 MBytes, speed up: 24.24x
==12188== Profiling application: Release/SpMV_streams.2
==12188== Profiling result:
Time(%)      Time     Calls       Avg       Min       Max  Name
 96.66%  68.614ms        76  902.82us  87.353us  8.4707ms  [CUDA memcpy HtoD]
  2.69%  1.9123ms        64  29.880us  29.403us  31.037us  incVec(float*, float*, unsigned long)
  0.42%  294.60us         2  147.30us  147.28us  147.31us  copyVecFromTex(float*, unsigned long, unsigned long)
  0.23%  163.47us         2  81.737us  81.657us  81.817us  [CUDA memcpy DtoH]

==12188== API calls:
Time(%)      Time     Calls       Avg       Min       Max  Name
 55.41%  115.42ms        10  11.542ms  5.7960us  114.22ms  cudaMalloc
 31.50%  65.611ms         8  8.2013ms  7.4460ms  8.9504ms  cudaMemcpyAsync
 10.09%  21.023ms        70  300.33us  206.03us  535.68us  cudaMemcpy

The identical code using 4 streams:

==12397== NVPROF is profiling process 12397, command: Release/SpMV_streams.4
     80.5 MBytes, speed up: 1.92x
==12397== Profiling application: Release/SpMV_streams.4
==12397== Profiling result:
Time(%)      Time     Calls       Avg       Min       Max  Name
 92.93%  854.13ms       256  3.3364ms  3.3184ms  3.3414ms  SpMV(float*, float*, int*, int*, float*, unsigned long)
  6.82%  62.646ms        84  745.79us  87.129us  4.0884ms  [CUDA memcpy HtoD]
  0.21%  1.9119ms        64  29.873us  29.503us  30.548us  incVec(float*, float*, unsigned long)
  0.03%  296.53us         2  148.26us  147.15us  149.38us  copyVecFromTex(float*, unsigned long, unsigned long)
  0.02%  163.83us         2  81.913us  81.849us  81.977us  [CUDA memcpy DtoH]

==12397== API calls:
Time(%)      Time     Calls       Avg       Min       Max  Name
 81.43%  846.22ms        70  12.089ms  184.80us  413.98ms  cudaMemcpy
 11.87%  123.33ms        10  12.333ms  5.7970us  122.19ms  cudaMalloc
  5.81%  60.329ms        16  3.7706ms  3.3210ms  4.4120ms  cudaMemcpyAsync

The remarkable thing here is that I cannot see profiling statistics for the SpMV kernel in the first case (2 streams)!!!
Like it was never called…

I also cannot see calls statistics for the version with no streams. Correctness of the programs is confirmed by tests, so the kernel is definitely called.

So, my question is: WTH is this? Maybe, some optimizations of nvcc? Or nvprof simply does not capture something?

Does your code check the status return of every CUDA API call, and the status of every kernel launch? If not, I suspect that what is happening that kernels do not get launched due to an uncaught error of one kind or other. What happens if you run the code under control of cuda-memcheck?

If you run the “working” case just before the “failing” case, the test framework may erroneously believe that the correct result was produced since it simply copies back GPU data previously deposited by the preceding “good” run. To avoid this effect, test frameworks should invalidate the data in all allocated result data structure, for example by using cudaMemset() with fill of 0x00 or 0xff (where the latter corresponds to a floating-point NaN pattern).

After adding checks of the last CUDA error, I found that for the 2-streams case kernel was no launched at all. Thank you for the hint.