Optimizing cuBlas in kernels

gpu_control · April 8, 2015, 1:12am

I am developing an algorithm on the GPU that involves some calls to cuBlas on the device. I am using a Tesla K20c(compute capability 3.5) with CUDA 7.0. The Tesla K20c has theoretical peak performance of 3.52 TFLOPs. I figured out the flop count for my algorithm, which consists mainly of matrix-vector multiplication and vector-vector addition, as 3,900,109. My back of the envelope calculation tells me that 3900109 * 1./ 3.52e.12 = 1.108 e-6 s or 1.108 usec. I have written the kernel to use clock64() to time a portion of it. I have been calling my kernel through pyCUDA in a for loop during a simulation and getting the elapsed clock ticks. Over the 400 or so cycles of the for loop the average elapsed clock ticks is 94451. 94451 / 706 MHz is ~ 133usec. I realize I will never get the theoretical peak performance from my GPU but I feel like my algorithm should be closer to the ~1 usec than ~133 usec. I post my kernel code below and if you have any bright ideas on how to make this faster please let me know, I am kind of new to GPU computing. Thanks.

#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <cuda_runtime.h>
#include <cublas_v2.h>
#include <helper_cuda.h>

extern "C" {

#define MATRIX_SIZE 1384
#define G_WIDTH 3
#define C_HEIGHT 22
#define BLOCK_SIZE 1384

#define LLCV2_STATUS_TLATCH 6
#define NCHAN 96
#define TLATCH_OFFSET ((NCHAN/2) + LLCV2_STATUS_TLATCH)

#define COUNTSPERVOLT_SCALED (1. / 30.) * (32767. / 10.)  //converts to counts and then scales by audio amp gain
#define VOLTSPERCOUNT 10. / 32767.
#define AO32MAX 0x8000                                    //max output of AO32 in counts 0x8000 = 10V

typedef unsigned int u32;
typedef unsigned short u16;
typedef signed short s16;

__global__ void valiant(const float *Phi,     //closed-loop dynamics matrix
			const float *Ko,      //observer gain
	 	        const float *Kc,      //controller gain
			float *xhatp,         //past state estimate
		        float *xhatf,         //future state estimate
			u32* d_hb,            //device inbuffer<-ACQ196
		        s16* d_ao_hb,         //device outbuffer->AO32
			float *u,             //system input
			//float *y,             //output vector
			float *tmp,           //temp array holder
		        uint64_t *clk,        //time measurement
			const u32 tlatch_go,  //start feedback when d_hb[TLATCH_OFFSET] = tlatch_go
			const u32 tlatch_stop //stop feedback when d_hb[TLATCH_OFFSET] = tlatch_stop
     	                ) {

  float y[C_HEIGHT]; //measurment vector
  //uint tx = threadIdx.x;
  //s16 ao[3]; //AO32 output
  //set memory locations
  //build measurement vector
  //y[2*tx]  = (float)((s16)d_hb[tx]);  y[2*tx + 1]  = (float)((s16)(d_hb[tx] >> 0x10));

  y[0]  = (float)((s16)d_hb[0]);  y[1]  = (float)((s16)(d_hb[0] >> 0x10));
  y[2]  = (float)((s16)d_hb[1]);  y[3]  = (float)((s16)(d_hb[1] >> 0x10));
  y[4]  = (float)((s16)d_hb[2]);  y[5]  = (float)((s16)(d_hb[2] >> 0x10));
  y[6]  = (float)((s16)d_hb[3]);  y[7]  = (float)((s16)(d_hb[3] >> 0x10));
  y[8]  = (float)((s16)d_hb[4]);  y[9]  = (float)((s16)(d_hb[4] >> 0x10));
  y[10] = (float)((s16)d_hb[5]);  y[11] = (float)((s16)(d_hb[5] >> 0x10));
  y[12] = (float)((s16)d_hb[6]);  y[13] = (float)((s16)(d_hb[6] >> 0x10));
  y[14] = (float)((s16)d_hb[7]);  y[15] = (float)((s16)(d_hb[7] >> 0x10));
  y[16] = (float)((s16)d_hb[8]);  y[17] = (float)((s16)(d_hb[8] >> 0x10));
  y[18] = (float)((s16)d_hb[9]);  y[19] = (float)((s16)(d_hb[9] >> 0x10));
  y[20] = (float)((s16)d_hb[10]); y[21] = (float)((s16)(d_hb[10] >> 0x10));

  const float alpha = 1.;
  const float nalpha = -1.;
  const float zero = 0.;
  const float *d_alpha = α
  const float *neg_alpha = &nalpha;
  const float *d_zero = &zero;
  cublasHandle_t cnpHandle;
  cublasStatus_t status = cublasCreate(&cnpHandle);

  if (status != CUBLAS_STATUS_SUCCESS)
  {

    return;
  }
  uint64_t start_time = clock64(); //start kernel timer

  //if (tx == 0) { 
    if (d_hb[TLATCH_OFFSET] >=  tlatch_go && d_hb[TLATCH_OFFSET] <= tlatch_stop) {
      //LQG in 3 steps
      //xhatf = Phi*xhatp + Ko*y
      status = cublasSgemv(cnpHandle,
			   CUBLAS_OP_N,
			   MATRIX_SIZE, MATRIX_SIZE,
			   d_alpha,
			   Phi, MATRIX_SIZE,
			   xhatp, 1,
			   d_zero,
			   xhatf, 1);

      status = cublasSgemv(cnpHandle,
			   CUBLAS_OP_N,
			   MATRIX_SIZE, C_HEIGHT,
			   d_alpha,
			   Ko, MATRIX_SIZE,
			   y, 1,
			   d_zero,
			   tmp, 1);

      status = cublasSaxpy(cnpHandle,
			   MATRIX_SIZE,
			   d_alpha,
			   tmp, 1,
			   xhatf, 1);

      //u = -Kc*xhatf
      status = cublasSgemv(cnpHandle,
			   CUBLAS_OP_N,
			   G_WIDTH, MATRIX_SIZE,
			   neg_alpha,
			   Kc, G_WIDTH,
			   xhatf, 1,
			   d_zero,
			   u, 1);
      //do time advancement
      status = cublasScopy(cnpHandle,
			   MATRIX_SIZE,
			   xhatf, 1,
			   xhatp, 1);

      //}
      //__syncthreads();

    //build output vector, check for overloading AO32

    //if (tx <= 2) {
    //ao[tx] = (s16)(u[tx] * COUNTSPERVOLT_SCALED);

    s16 ao0 = (s16)(u[0] * COUNTSPERVOLT_SCALED);
    s16 ao1 = (s16)(u[1] * COUNTSPERVOLT_SCALED);
    s16 ao2 = (s16)(u[2] * COUNTSPERVOLT_SCALED);

    // if (ao[tx] < AO32MAX)
    //   d_ao_hb[tx] = ao[tx];
    // else
    //   d_ao_hb[tx] = (s16)copysignf((float)AO32MAX, (float)ao[tx]);

    if (ao0 < AO32MAX)
      d_ao_hb[0] = ao0;
    else
      d_ao_hb[0] = (s16)copysignf((float)AO32MAX, (float)ao0);
    
    if (ao1 < AO32MAX)
      d_ao_hb[1] = ao1;
    else
      d_ao_hb[1] = (s16)copysignf((float)AO32MAX, (float)ao1);

    if (ao2 < AO32MAX)
      d_ao_hb[2] = ao2;
    else
      d_ao_hb[2] = (s16)copysignf((float)AO32MAX, (float)ao2);
    }
  uint64_t stop_time = clock64();
  //clk[0] = stop_time > start_time ? stop_time - start_time : stop_time + (0xffffffff - start_time);
  clk[0] = stop_time - start_time;
}

Gregory_Diamos · April 8, 2015, 1:34am

Keep in mind that matrix-vector and vector-vector operations are bandwidth-limited, not compute limited. Try counting up the bytes touched by those operations and dividing it by the memory bandwidth of the GPU. You should also add a few microseconds of overhead to each kernel you launch.

gpu_control · April 9, 2015, 6:25pm

@Gregory Diamos, thanks for your response. So from what you wrote, my algorithm’s FLOP counts is 3,900,109 and 4Bytes/float and the Tesla K20s memory bandwidth is quoted at 208GB/sec, then ((3900109 float * 4B/float) / 208.e9B/sec) * 1.e6usec/sec = 75usec. Does that sound right? 75us out of the 133us is spent just transferring data?

njuffa · April 9, 2015, 6:42pm

Keep in mind that 208 GB/sec is the theoretical bandwidth of a K20. With ECC enabled (default), the achievable bandwidth is more like 155 GB/sec. This is from memory but I don’t think I am far off. You can run the bandwidth test app that ships with the CUDA samples.

I do not think you can compute bandwidth directly from FLOPs. Some operations may be on register values. You would want to determine the total amount of data read from, and written to, global memory. E.g. in a vector addition, you would read two source elements and write one result element for each floating-point operation, making total memory traffic 3Nsizeof(array_element) bytes, where N is the number of elements per vector.

Topic		Replies	Views
How can I improve this code which only reduces half time for the same code using MATLAB, thanks! CUDA Programming and Performance	5	1631	September 29, 2009
CUBLAS question cublasGetVector() call CUDA Programming and Performance	3	5600	November 19, 2009
How to achieve 56 TFLOPS performance on RTX 500 Ada? CUDA Programming and Performance cuda	11	70	April 20, 2025
Reduced CuBLAS performance on a particular problem size? GPU-Accelerated Libraries	0	426	October 13, 2020
benchmark CUDA CuBLas and OpenCL CUDA Programming and Performance	13	28042	February 1, 2011
Speed-up and bandwidth CUDA Programming and Performance	12	9780	May 4, 2008
Measuring FLOPS CUDA Programming and Performance	8	14165	January 19, 2010
cuBlas performance dramatically drops after some iterations CUDA Programming and Performance	4	895	January 18, 2015
Why cublas is much slower than Matlab runs on CPU CUDA Programming and Performance	15	4979	February 10, 2011
Is it correct that my Pascal card is calling Maxwell_gemm kernels through cublas? And if so, why is cublas unusably slow for me? CUDA Programming and Performance	6	939	August 23, 2018

Optimizing cuBlas in kernels

Related topics