cusparse gtsv_nopivot overhead from cudaFree and cudaMalloc in each call

dmew · July 19, 2017, 9:35am

Hello,

I’m trying to solve 2D diffusion equation with alternate-direction imlicit (ADI) scheme, which implies solving tridiagonal linear systems multiple times.

To solve tridiagonal systems, I called cusparse gtsv_nopivot with multiple right hand sides.
The matrix size is 200x200, with 200 different right hand sides.
I call this problem two times in x and y directions in each time step.

My concerns were that the performance would be bounded by data transfers between the host and device, but actually the nvprof output results mean it is bounded by memory allocations and deallocations of temporary extra memory in gtvs_nopivot (if I understood it correctly):

==1539== Profiling result:
Time(%) Time Calls Avg Min Max Name
57.06% 5.64601s 39600 142.58us 134.37us 149.77us void pcrGlobalMemKernel_manyRhs(pcrGlobalMemParams_t)
28.50% 2.82014s 6600 427.29us 340.49us 565.30us void pcrLastStageKernel_manyRhs(pcrLastStageParams_t)
7.70% 761.87ms 6600 115.44us 112.32us 161.54us void pcrGlobalMemKernelFirstPass_manyRhs(pcrGlobalMemFirstPassParams_t)
3.41% 336.97ms 16500 20.422us 928ns 83.075us [CUDA memcpy HtoD]
3.33% 329.14ms 6600 49.869us 49.473us 92.227us [CUDA memcpy DtoH]

==1539== API calls:
Time(%) Time Calls Avg Min Max Name
76.33% 10.3866s 6609 1.5716ms 126ns 237.89ms cudaFree
13.53% 1.84118s 6604 278.80us 5.4580us 1.5086ms cudaMalloc
7.56% 1.02925s 23100 44.556us 3.4550us 2.1669ms cudaMemcpy
2.36% 320.84ms 52800 6.0760us 3.0790us 337.34us cudaLaunch
<…>

I wonder if it is somehow possible to preallocate memory for gtvs solver, so that it would use the same memory in each call? I’m afraid there’s no way to do this with high-level cusparse solver interface?
Or any other ideas to overcome this issue?

Thanks in advance!
Dmitry

Topic		Replies	Views
How to solve a tridiagonal matrix using the cusparse<t>gtsv2_nopivot() functions in the cusparse library GPU-Accelerated Libraries cusparse	7	1004	November 5, 2023
Does cusparse tri-diagonal solver has an offset cost? GPU-Accelerated Libraries	1	723	April 5, 2016
cusparse gtsv2 reg. Legacy PGI Compilers	4	3292	November 20, 2019
Using cusparseDgtsv2_nopivot() with OpenACC in Fortran code GPU-Accelerated Libraries cuda , cusparse	2	522	November 9, 2023
How to deallocate the device space during executing the program nvc, nvc++ and nvfortran	1	94	February 18, 2025
Repetitive calls to cusparseSpGEMM GPU-Accelerated Libraries cusparse	3	799	May 11, 2023
Would cudaMalloc-ing more memory than what cusparseXcsrgemmNnz calculated for a cusparse matrix work? GPU-Accelerated Libraries	0	455	June 3, 2020
Using cusparseDgtsv2_nopivot() with OpenACC in Fortran code nvc, nvc++ and nvfortran cuda	6	529	November 13, 2023
cusparseDgstv2 fortran issue: uninitialized memory access and out of bounds access GPU-Accelerated Libraries cusparse	1	39	March 26, 2026
Maximum amount of memory possible to allocate GPU-Accelerated Libraries	2	127	November 14, 2024

cusparse gtsv_nopivot overhead from cudaFree and cudaMalloc in each call

Related topics