I have a code that evaluates Nth degree polynominal. The major work is performed in a kernel that performs sparse matrix-vector multiplication (SpMV). After having implemented the GPU version of the algorithm, which is 20-40x faster than the CPU counterpart, I tried to improve it using streams. Suprisingly, this led to a dramatical performance drop, this version is as much slow as the CPU one. I tried nvprof to see what is going on, and here I saw something strange that is my actual question here.

The profiling result for the version with 2 streams:

==12188== NVPROF is profiling process 12188, command: Release/SpMV_streams.2
     80.5 MBytes, speed up: 24.24x
==12188== Profiling application: Release/SpMV_streams.2
==12188== Profiling result:
Time(%)      Time     Calls       Avg       Min       Max  Name
 96.66%  68.614ms        76  902.82us  87.353us  8.4707ms  [CUDA memcpy HtoD]
  2.69%  1.9123ms        64  29.880us  29.403us  31.037us  incVec(float*, float*, unsigned long)
  0.42%  294.60us         2  147.30us  147.28us  147.31us  copyVecFromTex(float*, unsigned long, unsigned long)
  0.23%  163.47us         2  81.737us  81.657us  81.817us  [CUDA memcpy DtoH]

==12188== API calls:
Time(%)      Time     Calls       Avg       Min       Max  Name
 55.41%  115.42ms        10  11.542ms  5.7960us  114.22ms  cudaMalloc
 31.50%  65.611ms         8  8.2013ms  7.4460ms  8.9504ms  cudaMemcpyAsync
 10.09%  21.023ms        70  300.33us  206.03us  535.68us  cudaMemcpy
==12397== NVPROF is profiling process 12397, command: Release/SpMV_streams.4
     80.5 MBytes, speed up: 1.92x
==12397== Profiling application: Release/SpMV_streams.4
==12397== Profiling result:
Time(%)      Time     Calls       Avg       Min       Max  Name
 92.93%  854.13ms       256  3.3364ms  3.3184ms  3.3414ms  SpMV(float*, float*, int*, int*, float*, unsigned long)
  6.82%  62.646ms        84  745.79us  87.129us  4.0884ms  [CUDA memcpy HtoD]
  0.21%  1.9119ms        64  29.873us  29.503us  30.548us  incVec(float*, float*, unsigned long)
  0.03%  296.53us         2  148.26us  147.15us  149.38us  copyVecFromTex(float*, unsigned long, unsigned long)
  0.02%  163.83us         2  81.913us  81.849us  81.977us  [CUDA memcpy DtoH]

==12397== API calls:
Time(%)      Time     Calls       Avg       Min       Max  Name
 81.43%  846.22ms        70  12.089ms  184.80us  413.98ms  cudaMemcpy
 11.87%  123.33ms        10  12.333ms  5.7970us  122.19ms  cudaMalloc
  5.81%  60.329ms        16  3.7706ms  3.3210ms  4.4120ms  cudaMemcpyAsync

The remarkable thing here is that I cannot see profiling statistics for the SpMV kernel in the first case (2 streams)!!!
Like it was never called…

I also cannot see calls statistics for the version with no streams. Correctness of the programs is confirmed by tests, so the kernel is definitely called.

So, my question is: WTH is this? Maybe, some optimizations of nvcc? Or nvprof simply does not capture something?