Hi,
I’m having performance issues with the cuSPARSE routine cusparseDcsrgeam.
After switching from CUDA 8.0 to CUDA 9.2, cusparseDcsrgeam is about 8 times slower, using the exact same input. I have no idea why, so I’m hoping to get some help and find out if this is due to recent changes in the library or if there’s anything I can do on my side to work around the problem.
The matrices I’m adding together are 2,500,0002,500,000 or 7,560,0002,500,000 and have a maximum of 17,380,000 non-zero elements. Their size difference doesn’t seem to affect the runtime much.
You can see the runtimes of the csrgeam_windowBased_core kernel measured with nvprof below:
CUDA 9.2
Time(%) Time Calls Avg Min Max Name
28.97% 5.17855s 120 43.155ms 36.771ms 49.312ms void csrgeam_windowBased_core<double, bool=0>(cusparseCsrgeamParams<double>)
21.62% 3.86527s 120 32.211ms 27.838ms 39.385ms void csrgeam_windowBased_core<float, bool=1>(cusparseCsrgeamParams<float>)
CUDA 8.0
6.90% 756.84ms 120 6.3070ms 4.9622ms 8.5097ms void csrgeam_windowBased_core<double, bool=0>(cusparseCsrgeamParams<double>)
3.88% 425.22ms 120 3.5435ms 2.7723ms 4.7569ms void csrgeam_windowBased_core<float, bool=1>(cusparseCsrgeamParams<float>)
Analyzing with the visual profiler also shows that when using CUDA 9.2, the kernels are launched with a grid size of (1890000, 1, 1) and a block size of (32, 4, 1). It also says that the kernels are limited by shared memory bandwidth, with 429329352 transactions for csrgeam_windowBased_core<double, bool=0> alone. None of the other memory types show any issues, and their utilization is between Idle and Low.
The first thing I notice when I run the same analysis on the CUDA 8.0 version, is that the grid size is exactly one 8th, at (236250, 1, 1) and the block size is the same dimension with x and y switched (4, 32, 1). Also, device memory utilization is at Medium. For CUDA 9.2 it was between Idle and Low.
All of this is running on a GTX 1060 in a laptop, with Ubuntu 16.04. I compile with the arch option set to sm_61. For additional info, here’s the deviceQuery output:
CUDA Driver Version / Runtime Version 9.2 / 9.2
CUDA Capability Major/Minor version number: 6.1
Total amount of global memory: 6078 MBytes (6373572608 bytes)
(10) Multiprocessors, (128) CUDA Cores/MP: 1280 CUDA Cores
GPU Max Clock rate: 1671 MHz (1.67 GHz)
Memory Clock rate: 4004 Mhz
Memory Bus Width: 192-bit
L2 Cache Size: 1572864 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 2 copy engine(s)
Run time limit on kernels: Yes
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
Device supports Unified Addressing (UVA): Yes
Device supports Compute Preemption: Yes
Supports Cooperative Kernel Launch: Yes
Supports MultiDevice Co-op Kernel Launch: Yes
Device PCI Domain ID / Bus ID / location ID: 0 / 1 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 9.2, CUDA Runtime Version = 9.2, NumDevs = 1
Result = PASS
Please help!
Best Regards