cusparse gtsv_nopivot overhead from cudaFree and cudaMalloc in each call


I’m trying to solve 2D diffusion equation with alternate-direction imlicit (ADI) scheme, which implies solving tridiagonal linear systems multiple times.

To solve tridiagonal systems, I called cusparse gtsv_nopivot with multiple right hand sides.
The matrix size is 200x200, with 200 different right hand sides.
I call this problem two times in x and y directions in each time step.

My concerns were that the performance would be bounded by data transfers between the host and device, but actually the nvprof output results mean it is bounded by memory allocations and deallocations of temporary extra memory in gtvs_nopivot (if I understood it correctly):

==1539== Profiling result:
Time(%) Time Calls Avg Min Max Name
57.06% 5.64601s 39600 142.58us 134.37us 149.77us void pcrGlobalMemKernel_manyRhs(pcrGlobalMemParams_t)
28.50% 2.82014s 6600 427.29us 340.49us 565.30us void pcrLastStageKernel_manyRhs(pcrLastStageParams_t)
7.70% 761.87ms 6600 115.44us 112.32us 161.54us void pcrGlobalMemKernelFirstPass_manyRhs(pcrGlobalMemFirstPassParams_t)
3.41% 336.97ms 16500 20.422us 928ns 83.075us [CUDA memcpy HtoD]
3.33% 329.14ms 6600 49.869us 49.473us 92.227us [CUDA memcpy DtoH]

==1539== API calls:
Time(%) Time Calls Avg Min Max Name
76.33% 10.3866s 6609 1.5716ms 126ns 237.89ms cudaFree
13.53% 1.84118s 6604 278.80us 5.4580us 1.5086ms cudaMalloc
7.56% 1.02925s 23100 44.556us 3.4550us 2.1669ms cudaMemcpy
2.36% 320.84ms 52800 6.0760us 3.0790us 337.34us cudaLaunch

I wonder if it is somehow possible to preallocate memory for gtvs solver, so that it would use the same memory in each call? I’m afraid there’s no way to do this with high-level cusparse solver interface?
Or any other ideas to overcome this issue?

Thanks in advance!