cudaFree time linearly depends on cublas call

RDMiles · March 26, 2013, 3:41pm

I’ve been trying to get familiar with using CUDA and cuBlas for my application, so I coded some timing into a test code and found that the time it takes to execute cudaFree() depends significantly on the number of calls to the cublasDgemm() routine. The relevant sections of my code are:

//... allocation and initialization of host memory is omitted ...
    cudaMallocPitch((void **)&d_A, &pchSzA, msz, p);
    cudaMallocPitch((void **)&d_B, &pchSzB, psz, n);
    cudaMallocPitch((void **)&d_C, &pchSzC, msz, n);
    pchNumA=pchSzA/sizeof(double);
    pchNumB=pchSzB/sizeof(double);
    pchNumC=pchSzC/sizeof(double);

    cudaMemcpy2D(d_A,pchSzA,A,msz,msz,p,cudaMemcpyHostToDevice);
    cudaMemcpy2D(d_B,pchSzB,B,psz,psz,n,cudaMemcpyHostToDevice);

    for (r = 0; r < LOOP_COUNT; r++) {
        cublasDgemm(handle, CUBLAS_OP_N, CUBLAS_OP_N, 
		m, n, p, &alpha, d_A,  pchNumA, d_B, pchNumB, &beta, d_C, pchNumC);
    }

s_initial = std::chrono::high_resolution_clock::now();
    if (cudaFree(d_A) != cudaSuccess)
    {
        fprintf(stderr, "!!!! memory free error (A): %d",cudaGetLastError());
    }
    s_final = std::chrono::high_resolution_clock::now();
    s_elapsed = (double)std::chrono::duration_cast<std::chrono::milliseconds>
                            (s_final-s_initial).count();
    printf (" == d_A freed, elapsed time is %.5f milliseconds == ", (s_elapsed));
	s_initial = dsecnd();

And similar cudaFree() calls for B & C. (I included additional error trapping, but I left it out here to condense the code.)

I get these results:
LOOP_COUNT=1: cudaFree time = 38 milliseconds
LOOP_COUNT=100: cudaFree time = 1864 milliseconds
LOOP_COUNT=10000: cudaFree time = 106580 milliseconds

Has anyone else seen this behavior?
I am using CUDA 5.0 with an NVS 4200M chip.

mfatica · March 26, 2013, 5:55pm

CUDA kernels are asynchronous.
You are measuring the DGEMM execution time. If you want to measure the cudaFree time, you should call
cudaDeviceSynchronize before s_initial.

RDMiles · March 26, 2013, 6:30pm

Of course! Thanks!

Glupol · March 26, 2013, 9:01pm

Just to clarify.
cudaFree synchronizes on 0th stream, which makes your CPU thread wait until all scheduled jobs on GPU are finished.
G.

Topic		Replies	Views
cudaFree painfully slow CUDA Programming and Performance	4	4599	January 29, 2010
cudaFree extremely slow CUDA Programming and Performance	15	2223	February 6, 2020
When cudaFree() will be called GPU-Accelerated Libraries cublas	3	1310	March 20, 2022
Calling kernel in a loop spends much time in cudaFree CUDA Programming and Performance	1	782	July 16, 2018
cudaFree is slow CUDA Programming and Performance	5	2847	November 13, 2010
Matrix matrix multiplication with CUBLAS on Geforce GTX 480 CUDA Programming and Performance	5	1700	October 5, 2010
cudaFree() error + loop CUDA Programming and Performance	1	6688	April 1, 2010
about latency to free device memory CUDA Programming and Performance	3	5561	February 18, 2008
cudaFree takes approx 99.5% of total time. CUDA Programming and Performance	2	1634	April 11, 2018
csrgemm csrgemm2 GPU-Accelerated Libraries	3	414	July 11, 2019

cudaFree time linearly depends on cublas call

Related topics