CUBLAS calling sgemm() multiple times - Change in CPU time


I’ve created a for loop such that inside the loop, memory is allocated/transferred, sgemm is called, and then memory is freed. At each iteration of the loop, different values of input matrices A and B are used (random values). I noticed that after a couple of iterations (the number depends on the size of the matrix), the CPU time of sgemm decreases to about 20 to 30% of its original time.

Why is this happening given that the two input matrices A and B are completely different (I assume so but haven’t checked the value) at each iteration?

Can you post a repro case, because it is impossible to say anything without some code.

Here it is.

[codebox] for (j = 0; j < total_iter; j++) {

status = cublasInit();

    if (status != CUBLAS_STATUS_SUCCESS)


            fprintf(stderr, "!!! CUBLAS initialization error\n");

            return EXIT_FAILURE;


/* Allocate host memory for matrices */

    h_A = (float*) malloc(n2 * sizeof(h_A[0]));

    h_B = (float*) malloc(n2 * sizeof(h_B[0]));

    h_C = (float*) malloc(n2 * sizeof(h_C[0]));

/* Fill the matrices with test data */

    for (i = 0; i < n2; i++)


            h_A[i] = rand() / (float)RAND_MAX;

            h_B[i] = rand() / (float)RAND_MAX;


/* Allocate device memory for the matrices */

    float p1 = clock();

    cublasAlloc(n2, sizeof(d_A[0]), (void**)&d_A);

    cublasAlloc(n2, sizeof(d_B[0]), (void**)&d_B);

    cublasAlloc(n2, sizeof(d_C[0]), (void**)&d_C);


    float p2 = clock();

    printf("total 1 = %lf\n", (p2 - p1) / CLOCKS_PER_SEC);

/* Copy A and B from host to device*/

    float p3 = clock();

    cublasSetVector(n2, sizeof(h_A[0]), h_A, 1, d_A, 1); 

    cublasSetVector(n2, sizeof(h_B[0]), h_B, 1, d_B, 1); 


    float p4 = clock();

    printf("total 2 = %lf\n", (p4 - p3) / CLOCKS_PER_SEC);

/* Perform matrix multiplication using sgemm */

    float p5 = clock();

    cublasSgemm('n', 'n', M, M, N, alpha, d_A, M, d_B, N, beta, d_C, M); 


    float p6 = clock();

    printf("total 3 = %lf\n", (p6 - p5) / CLOCKS_PER_SEC);

/* Allocate host memory for reading back the result from device memory */

    float p7 = clock();

    h_C = (float*) malloc (n2 * sizeof (h_C[0]));


    float p8 = clock();

    printf("total 4 = %lf\n", (p8 - p7) / CLOCKS_PER_SEC);

/* Read the result back */

    float p9 = clock();

    cublasGetVector(n2, sizeof(h_C[0]), d_C, 1, h_C, 1);


    float p10 = clock();

    printf("total 5 = %lf\n", (p10 - p9) / CLOCKS_PER_SEC);

/* memory cleanup */

    free (h_A);

    free (h_B);

    free (h_C);






Notice the difference in total 3 time

total 1 = 0.000400

total 2 = 0.058846

total 3 = 0.542948

total 4 = 0.000026

total 5 = 0.019047

total 1 = 0.000381

total 2 = 0.058186

total 3 = 0.398009

total 4 = 0.000018

total 5 = 0.019063

total 1 = 0.000344

total 2 = 0.059284

total 3 = 0.398278

total 4 = 0.000017

total 5 = 0.019045

total 1 = 0.000348

total 2 = 0.058342

total 3 = 0.395848

total 4 = 0.000015

total 5 = 0.019466

There are a lot of things that could be or probably are going wrong in that code.

  1. Try only calling cublasInit once, outside the loop. You are putting context establishment and warmup overheads in your timing, which you don’t want
  2. Unless I am mistaken, that code has a huge host side memory leak. You allocate h_C twice per iteration and only free it once. I don’t know what size matrices you are using, but if they are reasonable size, you could be generating a lot of host side paging and swapping at different times during the run as free memory leaks away
  3. Use cudaThreadSynchronize sparingly. You only need it around non-blocking cublas function calls (the blas functions themselves). Copies and device memory allocations do not require it. Host memory allocation most certainly doesn’t.
  4. What is the accuracy and repeatability of the clock() call relative to the expected cublas execution time? For short kernels, using cudaEvents for timing gives roughly 0.5 microsecond accurate device side timers, which are far better than most host sources for timing short duration device side operations.

Benchmarking is a surprisingly exacting programming exercise. This is a lot to look out for if you are going to get it right…

i have a similar problem… I’d like to call matrixMul multiple times:

void matrixMul( float* A, float* B, float* C, int m, int k, int n, const float alpha, const float betha){
	//C=al*op(A)op(B) + bet*C,	  A: m x k,		B: k x n,	 C: m x n

	cudaError_t cudaStatus; 
	cublasStatus_t status; 
	cublasHandle_t handle;	

	const float al = alpha; 
	const float bet = betha; 

        float* d_a;
        float* d_b;
        float* d_c;

	cudaStatus = cudaMalloc( (void**) & d_a, sizeof(float) * m * k );
	cudaStatus = cudaMalloc( (void**) & d_b, sizeof(float) * k * n );
	cudaStatus = cudaMalloc( (void**) & d_c, sizeof(float) * m * n );

	status = cublasCreate(&handle);

	status = cublasSetMatrix( m, k, sizeof(float), A, m, d_a, m );		
	status = cublasSetMatrix( k, n, sizeof(float), B, k, d_b, k );

        status = cublasSgemm( handle, CUBLAS_OP_N, CUBLAS_OP_N, m, n, k, &al, d_a, m, d_b, k, &bet, d_c, m );

	status = cublasGetMatrix( m, n, sizeof(float), d_c, m, C, m);


What do I have to change so that the calculations are correct each time?