I have a code that evaluates Nth degree polynominal. The major work is performed in a kernel that performs sparse matrix-vector multiplication (SpMV). After having implemented the GPU version of the algorithm, which is 20-40x faster than the CPU counterpart, I tried to improve it using streams. Suprisingly, this led to a dramatical performance drop, this version is as much slow as the CPU one. I tried nvprof to see what is going on, and here I saw something strange that is my actual question here.
The profiling result for the version with 2 streams:
==12188== NVPROF is profiling process 12188, command: Release/SpMV_streams.2
80.5 MBytes, speed up: 24.24x
==12188== Profiling application: Release/SpMV_streams.2
==12188== Profiling result:
Time(%) Time Calls Avg Min Max Name
96.66% 68.614ms 76 902.82us 87.353us 8.4707ms [CUDA memcpy HtoD]
2.69% 1.9123ms 64 29.880us 29.403us 31.037us incVec(float*, float*, unsigned long)
0.42% 294.60us 2 147.30us 147.28us 147.31us copyVecFromTex(float*, unsigned long, unsigned long)
0.23% 163.47us 2 81.737us 81.657us 81.817us [CUDA memcpy DtoH]
==12188== API calls:
Time(%) Time Calls Avg Min Max Name
55.41% 115.42ms 10 11.542ms 5.7960us 114.22ms cudaMalloc
31.50% 65.611ms 8 8.2013ms 7.4460ms 8.9504ms cudaMemcpyAsync
10.09% 21.023ms 70 300.33us 206.03us 535.68us cudaMemcpy
...
The identical code using 4 streams:
==12397== NVPROF is profiling process 12397, command: Release/SpMV_streams.4
80.5 MBytes, speed up: 1.92x
==12397== Profiling application: Release/SpMV_streams.4
==12397== Profiling result:
Time(%) Time Calls Avg Min Max Name
92.93% 854.13ms 256 3.3364ms 3.3184ms 3.3414ms SpMV(float*, float*, int*, int*, float*, unsigned long)
6.82% 62.646ms 84 745.79us 87.129us 4.0884ms [CUDA memcpy HtoD]
0.21% 1.9119ms 64 29.873us 29.503us 30.548us incVec(float*, float*, unsigned long)
0.03% 296.53us 2 148.26us 147.15us 149.38us copyVecFromTex(float*, unsigned long, unsigned long)
0.02% 163.83us 2 81.913us 81.849us 81.977us [CUDA memcpy DtoH]
==12397== API calls:
Time(%) Time Calls Avg Min Max Name
81.43% 846.22ms 70 12.089ms 184.80us 413.98ms cudaMemcpy
11.87% 123.33ms 10 12.333ms 5.7970us 122.19ms cudaMalloc
5.81% 60.329ms 16 3.7706ms 3.3210ms 4.4120ms cudaMemcpyAsync
...
The remarkable thing here is that I cannot see profiling statistics for the SpMV kernel in the first case (2 streams)!!!
Like it was never called…
I also cannot see calls statistics for the version with no streams. Correctness of the programs is confirmed by tests, so the kernel is definitely called.
So, my question is: WTH is this? Maybe, some optimizations of nvcc? Or nvprof simply does not capture something?