I’m having performance issues with the cuSPARSE routine cusparseDcsrgeam.
After switching from CUDA 8.0 to CUDA 9.2, cusparseDcsrgeam is about 8 times slower, using the exact same input. I have no idea why, so I’m hoping to get some help and find out if this is due to recent changes in the library or if there’s anything I can do on my side to work around the problem.
The matrices I’m adding together are 2,500,0002,500,000 or 7,560,0002,500,000 and have a maximum of 17,380,000 non-zero elements. Their size difference doesn’t seem to affect the runtime much.
You can see the runtimes of the csrgeam_windowBased_core kernel measured with nvprof below:
Time(%) Time Calls Avg Min Max Name 28.97% 5.17855s 120 43.155ms 36.771ms 49.312ms void csrgeam_windowBased_core<double, bool=0>(cusparseCsrgeamParams<double>) 21.62% 3.86527s 120 32.211ms 27.838ms 39.385ms void csrgeam_windowBased_core<float, bool=1>(cusparseCsrgeamParams<float>)
6.90% 756.84ms 120 6.3070ms 4.9622ms 8.5097ms void csrgeam_windowBased_core<double, bool=0>(cusparseCsrgeamParams<double>) 3.88% 425.22ms 120 3.5435ms 2.7723ms 4.7569ms void csrgeam_windowBased_core<float, bool=1>(cusparseCsrgeamParams<float>)
Analyzing with the visual profiler also shows that when using CUDA 9.2, the kernels are launched with a grid size of (1890000, 1, 1) and a block size of (32, 4, 1). It also says that the kernels are limited by shared memory bandwidth, with 429329352 transactions for csrgeam_windowBased_core<double, bool=0> alone. None of the other memory types show any issues, and their utilization is between Idle and Low.
The first thing I notice when I run the same analysis on the CUDA 8.0 version, is that the grid size is exactly one 8th, at (236250, 1, 1) and the block size is the same dimension with x and y switched (4, 32, 1). Also, device memory utilization is at Medium. For CUDA 9.2 it was between Idle and Low.
All of this is running on a GTX 1060 in a laptop, with Ubuntu 16.04. I compile with the arch option set to sm_61. For additional info, here’s the deviceQuery output:
CUDA Driver Version / Runtime Version 9.2 / 9.2 CUDA Capability Major/Minor version number: 6.1 Total amount of global memory: 6078 MBytes (6373572608 bytes) (10) Multiprocessors, (128) CUDA Cores/MP: 1280 CUDA Cores GPU Max Clock rate: 1671 MHz (1.67 GHz) Memory Clock rate: 4004 Mhz Memory Bus Width: 192-bit L2 Cache Size: 1572864 bytes Maximum Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384) Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers Total amount of constant memory: 65536 bytes Total amount of shared memory per block: 49152 bytes Total number of registers available per block: 65536 Warp size: 32 Maximum number of threads per multiprocessor: 2048 Maximum number of threads per block: 1024 Max dimension size of a thread block (x,y,z): (1024, 1024, 64) Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535) Maximum memory pitch: 2147483647 bytes Texture alignment: 512 bytes Concurrent copy and kernel execution: Yes with 2 copy engine(s) Run time limit on kernels: Yes Integrated GPU sharing Host Memory: No Support host page-locked memory mapping: Yes Alignment requirement for Surfaces: Yes Device has ECC support: Disabled Device supports Unified Addressing (UVA): Yes Device supports Compute Preemption: Yes Supports Cooperative Kernel Launch: Yes Supports MultiDevice Co-op Kernel Launch: Yes Device PCI Domain ID / Bus ID / location ID: 0 / 1 / 0 Compute Mode: < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) > deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 9.2, CUDA Runtime Version = 9.2, NumDevs = 1 Result = PASS