performance issue


I am trying to implement a large chunk of computation involved in the least squares fit of a circle on the GPU. I had posted my code earlier as well and got a couple of suggestions which I implemented and got a significant performance gain. But there still seem to be some issues which cause my implementation to be much slower than the CPU. One of them I think is this

1)I am multiplying a matrix with its transpose ((3xn)*(nx3)). I have a block size of 50x3 and my threadIdx.x (tid) goes from 0-50 and threadIdx.y (wid) from 0-3. The result obviously is a 3x3 but the problem is that the index tid goes beyond 3 which would result in unnecessary computation.

Here is my code…

Host code

// includes, system

#include <stdlib.h>

#include <stdio.h>

#include <string.h>

#include <math.h>

// includes, project

#include <cutil.h>

// includes, kernels

#include <>



//! Entry point for Cuda functionality on host side

//! @param data  data to process on the device

//! @param estimates   initial estimates of parameters


extern "C" 

void runTest(float *data, float *estimates, int n )




	unsigned int hTimer;



        unsigned int size_A = n*3;    

        unsigned int mem_size_A = sizeof(float)* size_A;

	unsigned int size_B = 3;

	unsigned int mem_size_B = sizeof(float)*size_B;

	unsigned int size_updates = 3;

	unsigned int mem_size_updates = sizeof(float)*size_updates;	

	// setup execution parameters

	dim3 threads(50,3,1);

	dim3 grid(n/threads.x,1,1);	

       // allocate device memory

        float* d_A;

        CUDA_SAFE_CALL(cudaMalloc((void**) &d_A, mem_size_A));

	float* d_B;

        CUDA_SAFE_CALL(cudaMalloc((void**) &d_B, mem_size_B));

	float* d_updates;

        CUDA_SAFE_CALL(cudaMalloc((void**) &d_updates, mem_size_updates));	

	// allocate mem for the result on host side

	float* h_D = (float*) malloc(mem_size_updates);

	float err0;

	float eps = 0.00001;


	// copy host memory to device

        CUDA_SAFE_CALL(cudaMemcpy(d_A, data, mem_size_A,

                              cudaMemcpyHostToDevice) );




        CUDA_SAFE_CALL(cudaMemcpy(d_B, estimates, mem_size_B,

                              cudaMemcpyHostToDevice) );


  // execute the kernel

  kernel<<< grid, threads>>>( d_A, d_B, d_updates, n);


  // check if kernel execution generated an error

  CUT_CHECK_ERROR("Kernel execution failed");



  // copy result from device to host

  CUDA_SAFE_CALL(cudaMemcpy(h_D, d_updates, mem_size_updates,





        err0=(fabs(h_D[0]) + fabs(h_D[1]) + fabs(h_D[2]));



	printf("Total GPU time: %f msecs.\n", cutGetTimerValue(hTimer));

	printf("\n D\n");

	for(unsigned int i = 0; i<3; i++)


  printf("%f \n", estimates[i]);


	// cleanup memory












#define AS(i,j)   CUT_BANK_CHECKER(((float*)&As[0][0]))

#define BS(i)     CUT_BANK_CHECKER(((float*)&Bs[0]))

#define CS(i, j)  CUT_BANK_CHECKER(((float*)&Cs[0][0]))

#define DS(i)     CUT_BANK_CHECKER(((float*)&Ds[0]))

#define PS(i)     CUT_BANK_CHECKER(((float*)&Ps[0]))

#define RS(i)     CUT_BANK_CHECKER(((float*)&Rs[0]))

#define LS(i)   CUT_BANK_CHECKER(((float*)&Ls[0]))

#define YS(i)     CUT_BANK_CHECKER(((float*)&Ys[0]))

#define XS(i)     CUT_BANK_CHECKER(((float*)&Xs[0]))


#define AS(i, j)   As[i][j]

#define BS(i)      Bs[i]

#define CS(i, j)   Cs[i][j]

#define DS(i)      Ds[i]

#define PS(i)      Ps[i]

#define RS(i)      Rs[i]

#define LS(i)      Ls[i]

#define YS(i)      Ys[i]

#define XS(i)      Xs[i]



//! Kernel for minimization of parameters for least squares fit of a circle//


__global__ void kernel(float* A,  float* B, float* D, unsigned int n)



	// Block index

    const unsigned int bx = blockIdx.x;

    const unsigned int by = blockIdx.y;

   // Thread index

    const unsigned int tid = threadIdx.x;

    const unsigned int wid = threadIdx.y;

	const unsigned int htid = (tid * blockDim.y) + wid;

   // Index of the first sub-matrix of A processed by the block

    int aBegin = n * blockDim.x * by;

   // Index of the last sub-matrix of A processed by the block

    int aEnd   = aBegin + n*3;

   // Step size used to iterate through the sub-matrices of A

    int aStep  = blockDim.x*blockDim.y;


	__shared__ float As[50][3];

    __shared__ float Bs[3];

	__shared__ float Cs[50][3];

	__shared__ float Ds[50];

	__shared__ float Ps[9];

	__shared__ float Rs[3];

	__shared__ float Ls[9];

	__shared__ float Ys[3];

	__shared__ float Xs[3];

	float Csub = 0.0;

	float Dsub = 0.0;

	float sum = 0.0;

	for (int a = aBegin;a < aEnd;a += aStep)


        //loading data onto shared memory  

        AS(tid, wid) = A[a+htid];


       BS(0) = B[0];

        BS(1) = B[1];

        BS(2) = B[2];

       float r=0.0;

       //Populating the Jacobian

        r = sqrt((AS(tid, 0)-BS(0))*(AS(tid, 0)-BS(0))+(AS(tid, 1)-BS(1))*(AS(tid, 1)-BS(1)));

  CS(tid, 0) = -(AS(tid, 0)-BS(0))/r;

        CS(tid, 1) = -(AS(tid, 1)-BS(1))/r;

        CS(tid, 2) = -(AS(tid, 2));

        DS(tid) = -(r - BS(2));


 //multiplying the Jacobian with its transpose

  for(int k = 0; k < blockDim.x; k++)

  	Csub += CS(k,wid)* CS(k,tid);


  PS(htid) = Csub;

 //the right-hand side vector

  for(int k = 0; k < blockDim.x; ++k)

  	Dsub += CS(k, wid) * DS(k);


  RS(wid) = Dsub;



	//Solving the matrix using Cholesky decomposition to obtain the lower triangular matrix

	for(int i=0; i<blockDim.y;++i)


  LS(i*(blockDim.y+1)) = sqrt(PS(i*(blockDim.y+1)));

  for(int j=0;j<blockDim.y;++j)



  	for(int k=0; k <= i-1; ++k)


    sum += LS(i*blockDim.y+k)*LS(j*blockDim.y+k);


  	LS(j*blockDim.y+i) = (PS(i*blockDim.y+j)-sum)/LS((blockDim.y+1)*i);




	//forward substitution

	YS(0)= RS(0)/LS(0);  	

	for (int i=1; i<blockDim.y; i++)


  sum = 0.0;

  for (int j=0; j<=i-1; j++)


  	sum += LS(i*blockDim.y+j)*YS(j);


  YS(i) = (RS(i)-sum)/LS((blockDim.y+1)*i);


	XS(blockDim.y-1) = YS(blockDim.y-1)/LS(blockDim.y*(blockDim.y-1)+(blockDim.y-1));

	for (int i=blockDim.y-2; i>=0; i--)


  sum = 0.0;

  for (int j=i+1; j<blockDim.y; j++)


  	sum += LS(j*blockDim.y+i)*XS(j);


  XS(i) = (YS(i)-sum)/LS((blockDim.y+1)*(i));


	D[wid] = XS(wid);



#endif // #ifndef _FITCIRCLE_KERNEL_H_

Here n is the size which is any multiple of 50 as my block size is 50.

The results are alright but I am just not able to figure out why I am not getting the performance I need. I would greatly appreciate any suggestions about how Icould make this more parallel as I really need to get this working quickly.

Thanks in advance.


Try compiling your kernel with nvcc using the option --cubin and from the resulting cubin file read off the register count and amount of shared memory. Armed with this information use the occupancy calculator to figure our your occupancy, maybe it is very low.

See this for more details:


I’m by no means an expert, so take my suggestions with a grain of salt.

  1. Isn’t this __syncthreads unnecessary?
    PS(htid) = Csub;

  2. Why are you using 50 for the width of your block? If I understand correctly this guarantees that some of your threads will be idle; use a multiple of 16 (or is it 32?)

  3. This looks wrong; CS is 50x3 but the second CS is using the tid in the second index
    Csub += CS(k,wid)* CS(k,tid);

  4. The code after “//Solving the matrix” is essentially single threaded… the entire execution is done by all 50x3 threads, but they are all computing exactly the same thing. This is clear from the fact that the only line that is dependent on the thread id is D[wid] = XS(wid)
    If you can’t improve the code, try running it only if ( wid == 0 and tid == 0 ). Sure the other threads in the warp will run it but other warps will all skip it.

Same problem with the //Populating the Jacobian code; put an if wid == 0 so that the warps with wid > 0 will all branch over the code.

Experts please call my bluff

Another thing you can try is the following (in several places). For this code

//forward substitution

YS(0)= RS(0)/LS(0);   

for (int i=1; i<blockDim.y; i++)


 sum = 0.0;

 for (int j=0; j<=i-1; j++)


  sum += LS(i*blockDim.y+j)*YS(j);


 YS(i) = (RS(i)-sum)/LS((blockDim.y+1)*i);


As mentioned, all threads will compute the exact same thing.

But you know that blockDim.x > blockDim.y, so instead of using i as the index, use tid, and that way different threads will compute different portions simultaneously:

if ( wid == 0 && tid < blockDim.y )


   sum = 0.0;

   for ( int j = 0; j<tid; j++ )

       sum += LS(tid*blockDim.y+j)*YS(j);

   YS(tid) = (RS(tid)-sum)/LS((blockDim.y+1)*tid);



  • the code runs for tid == 0, i.e., i==0 in your code; no point in all threads initializing YS(0) before the loop, not saving anything there since all the threads are doing the math anyway!

  • All the threads of the warp will run all the iterations, there is still a lot of divergence for the j loop.

The warp size on the G80 is 32, so you want to aim for a multiple of that number to keep all the pipelines filled.

Hi Stewie,

Thanks a lot for the suggestions, I got rid of the unnecessary __syncthreads and it works better now. I should’ve tried the multiple of 32 threads earlier and the third issue that you have pointed is what I’m struggling to get around. I basically need an intermediate index that goes from 0-3 and it needs to get incremented before wid. The tid gives me the right result because I copy back only the part I need (3x3). But this is very inefficient as there is unnecessary computation.

From what I see, this is the only way I can multiply the (3x50)*(50x3), could you please tell me if i can get around this?


Just another thing, I did not quite follow the 4th point that you have mentioned. The D[wid] = XS(wid) is to copy back the result after computation from shared memory to device memory. The reason I did not use __syncthreads is because I am just trying to operate on a 3x3 matrix and a 3x1 vector. Also, this is outside the loop for the blocks. I am curious to know if it is possible to carry out such a computation at all…after multiplying a 3xn by an nx3 I end up with a 3x3…the cholesky and forward substitution on this small a matrix seems very dicey to me…as you have mentioned it is single threaded. I tried the if(wid==0 && tid == 0) as suggested by you but that did not work…or I think it is my lack of understanding…:)

Thanks & regards,



I observed another thing about my CUDA implementation…from my understanding of the Debug and Release execution configuration of any code, the debug mode writes out information at runtime while the Release does not and this is the reason release execution time is 3-4 times quicker. But I do not see any change in execution times when the I execute the CUDA code. Is there something that I need to change??

Thanks in advance,


You should always measure the kernel execution times using the CUDA profiling tool. See the documentation. It will write a file which holds kernel runtimes and time spent for other CUDA operations like cudaMemcpy. Helps to see that the bottleneck is not actually your host program. :)


I executed the CUDA example scalarprod in Debug and Release modes…This includes time profiling the kernel execution time and the CPU time…from what I observe…there is no difference in the kernel execution time for both configurations…but the CPU time is about 4.5 times quicker in Release than in Debug. Does this mean the nvcc compiler is designed to execute the kernel in one configuration??



The code executed on the GPU is the same in both debug and release mode, so it makes sense that the GPU time should be the same. In debug mode, the host (CPU) code is compiled without optimizations and with debugging support, so it makes sense that the CPU time will be higher in debug mode.


As I understand it, with a 50x3 thread block, you will have 150/32 warps = 5 warps, all doing those for loops and executing the same thing.

Since they are all computing the XS values in that loop, you don’t really need all five warps computing exactly the same thing.

So the idea is to wrap it with the if to ensure that only one of the warps does the work; since all the other warps will consistently branch they will not execute this code at all.

You are right though, if you don’t have a syncthreads before the D = XS assignment, then it cannot work. This should however:

if ( wid == 0 && tid < 3 )


   // the 3 for loops

   // Now do the final assignment with the tid instead

   D[tid] = XS(tid);


Let me know if this works; if it doesn’t then I’ve misunderstood something!

Note it will mean less work for the GPU to do, but since your code is overall so single threaded, it may not give substantial gains.


Please correct me if I’m wrong. I just realized from what you have said above that my understanding of parallel computation of not only the cholesky and forward substitution but also while populating the Jacobian matrix was incorrect. I assumed that while doing the Jacobian… the fact that there is a different formula operating on each of the rows of the initial 3xn matrix passed into the kernel could be parallelized but evidently only a single instruction is parallelized while working with blocks. I see that I am trying to assign different instructions to different warps within a block which is causing the entire computation to be serialized. Is there a way I could make this part parallel??
Also, I changed my block size to a multiple of 32 in fact to 64x3 and saw significant gain in performance …but the condition if(wid==0 && tid<3) for the last part of the computation did not work the results are some undefined values…thanks a lot for the inputs and help.



I’m not sure why it isn’t working. What am I missing?

The cholesky and forward substitution parts are completely independent of the threadid, it is constant for the entire block.

So does at least this work?

if ( wid == 0 && tid == 0 )


    // cholesky code

    // forward substitution code



D[wid] = XS(wid);

Hi Stewie,

I tried what you have mentioned above but I’m unable to get the right result by doing so. I am really confused about this. I have done everything within my reach to optimize it…but I am just not able to figure out the reason for the latency. :( . I have attached the changed kernel and the host part of the code…Could you please take a look again and let me know if there is anything I can do…I would greatly appreciate your help.



Hi Stewie,

I tried what you have mentioned above but I’m unable to get the right result by doing so. I am really confused about this. I have done everything within my reach to optimize it…but I am just not able to figure out the reason for the latency. :( . I have attached the changed kernel and the host part of the code…Could you please take a look again and let me know if there is anything I can do…I would greatly appreciate your help.



I have been trying to find the documentation that talks about the CUDA profiling tool but I am unable to. Could someone please tell me where to find something which talks about the profiling tool and how to use it?


There should be a document CUDA_Profiler_0.8.txt under /usr/local/cuda/doc.

Anyway to enable the profiler, set a variable CUDA_PROFILE=1 in your shell and run the code as usual. It will generate a trace called cuda_profile.log.

You can change the name of the file using the variable CUDA_PROFILE_LOG.


I am trying to use this profiling tool in a windows environment (Visual Studio 2003). Could you please let me know where I would have to set this variable?



In the terminal from which you start your application.
It is the driver that is instrumented, you don’t need to change your compiler flags.