I have been working on implementing a minimization technique for the least squares fitting of a circle on the GPU. I pass in an nx3 array along with a 3x1 vector and obtain a 3x1vector as the result. My implementation is right but I have two issues

  1. I am using just one block onto which I load all data which limits my array size

  2. Also the implementation is very slow.

I tried using blocks but the kernel does not allow me to pass in an array size more than 256x3 even if I use about 200 threads per block and increase the number of blocks in the grid. I’m not too sure how i can get around this. Here is the code which I know is very inefficient.

host code

void runTest(float *data, float *estimates, int n )



	unsigned int hTimer;




        unsigned int size_A = n*3;    

        unsigned int mem_size_A = sizeof(float)* size_A;

	unsigned int size_B = 3;

	unsigned int mem_size_B = sizeof(float)*size_B;

	unsigned int size_updates = 3;

	unsigned int mem_size_updates = sizeof(float)*size_updates;  

       // allocate device memory

        float* d_A;

        CUDA_SAFE_CALL(cudaMalloc((void**) &d_A, mem_size_A));

	float* d_B;

        CUDA_SAFE_CALL(cudaMalloc((void**) &d_B, mem_size_B));

	float* d_updates;

        CUDA_SAFE_CALL(cudaMalloc((void**) &d_updates, mem_size_updates));	


	//allocate device memory to store results

	//unsigned int mem_size_C = sizeof(float)* 3 * 3;


	//CUDA_SAFE_CALL(cudaMalloc((void**) &d_C, mem_size_C));

	// allocate mem for the result on host side

	float* h_D = (float*) malloc(mem_size_updates);

	// setup execution parameters

        dim3 grid(1,1,1);

	dim3 threads(3,n,1);


        double err0;

        double eps = 0.000001;	


	// copy host memory to device

        CUDA_SAFE_CALL(cudaMemcpy(d_A, data, mem_size_A,

                              cudaMemcpyHostToDevice) );



            CUDA_SAFE_CALL(cudaMemcpy(d_B, estimates, mem_size_B,

                              cudaMemcpyHostToDevice) );              


  	// execute the kernel

            kernel<<< grid, threads, mem_size_A*4>>>(d_A, d_B,d_updates, n);


           // check if kernel execution generated an error

            CUT_CHECK_ERROR("Kernel execution failed");

           // copy result from device to host

            CUDA_SAFE_CALL(cudaMemcpy(h_D, d_updates, mem_size_updates,





            err0=(double)(fabs(h_D[0]) + fabs(h_D[1]) + fabs(h_D[2]));




        // cleanup memory






        printf("GPU time: %f msecs.\n", cutGetTimerValue(hTimer));


	for(unsigned int i = 0; i<3; i++)


  printf("%f \n", estimates[i]);





#define AS(i) data[ASindex+(i)]

#define BS(i) data[BSindex+(i)]

#define CS(i) data[CSindex+(i)]

#define DS(i) data[DSindex+(i)]

#define PS(i) data[PSindex+(i)]

#define RS(i) data[RSindex+(i)]

#define LS(i) data[LSindex+(i)]

#define YS(i) data[YSindex+(i)]

#define XS(i) data[XSindex+(i)]


//! Kernel for minimization of parameters for least squares fit of a circle


__global__ void kernel(float* A,  float* B, float* D, unsigned int n)



	extern __shared__ float data[];

	unsigned int ASindex=0;

	unsigned int BSindex=3*n;

	unsigned int CSindex=0;

	unsigned int DSindex=3*n+3;

	unsigned int PSindex=4*n+3;

	unsigned int RSindex=4*n+12;

	unsigned int LSindex=4*n+15;

	unsigned int YSindex=4*n+24;

	unsigned int XSindex=4*n+27;

	// access thread id

	const unsigned int tid = threadIdx.x; // 0-3

	// warp id

	const unsigned int wid = threadIdx.y; // 0-n

	const unsigned int ltid = (wid * blockDim.x) + tid;


	//loading data onto shared memory  




	BS(0) = B[0];

	BS(1) = B[1];

	BS(2) = B[2];


	float r=0.0;

	//Populating the Jacobian




  CS(ltid) = -(AS(ltid)-BS(0))/r;

  CS(ltid+1) = -(AS(ltid+1)-BS(1))/r;

  CS(ltid+2) = -(AS(ltid+2));

  DS(wid)= -(r-BS(2));  



	float Csub = 0.0;

	//multiplying the transpose with the Jacobian	

	for (int k = 0; k < n; ++k)

	Csub += CS(k*blockDim.x + tid)* CS( blockDim.x * k + wid );



	PS(ltid) = Csub;

	//C[ltid]= PS(ltid);

	float Dsub = 0.0;

	//multiplying the Jacobian by the right hand side vector

	for (int k = 0; k < n; ++k)

	Dsub += CS(k*blockDim.x + tid)* DS(k);



	//D[tid] = RS(tid);

	float sum;

	//Solving the matrix using Cholesky decomposition

	for(int i=0; i<blockDim.x;i++)



  for(int j=0;j<blockDim.x;j++)



  	for(int k=0; k <= i-1; k++)


    sum += LS(i*blockDim.x+k)*LS(j*blockDim.x+k);


  	LS(j*blockDim.x+i) = (PS(i*blockDim.x+j)-sum)/LS((blockDim.x+1)*i);





	//Forward Substitution

	YS(0)= RS(0)/LS(0);  	

	for (int i=1; i<blockDim.x; i++)


  sum = 0.0;

  for (int j=0; j<=i-1; j++)


  	sum += LS(i*blockDim.x+j)*YS(j);


  YS(i) = (RS(i)-sum)/LS((blockDim.x+1)*i);



	for (int i=blockDim.x-2; i>=0; i--)


  sum = 0.0;

  for (int j=i+1; j<blockDim.x; j++)


  	sum += LS(j*blockDim.x+i)*XS(j);


  XS(i) = (YS(i)-sum)/LS((blockDim.x+1)*(i));



	D[tid] = XS(tid);



Thanks in advance for any help.


There are a few of things that are affecting your performance negatively:

  1. Using only one block, as you mentioned, which means that you’re only using one multiprocessor (out of 16 available on a g80).

  2. There are quite a few shared memory references that will cause bank-conflicts. Any memory references that use tid as part of indexing will conflict, since the range for your tid is 0 through 2, inclusively. Thus, only 1/3rd of a warp’s threads are actively reading shared memory.

  3. Compute-intensive portions of your code are not parallelized - all threads are performing redundant work since they operate on exactly the same data. This happens in your Cholesky decomposition as well as forward substitution. In such cases you are effectively using only one core of a multiprocessor (1/128th of g80 capability).

As far as passing data to the device goes, a single block needs 48*n bytes of shared memory. When n=256, that amounts to 12KB. Since g80 provide 16KB shared memory per block, your n cannot exceed 341.

I’d suggest looking up parallel algorithms and implementing those with cuda, rather than porting a sequential formulation.

Thanks for the reply. I guess I am not too clear about how the data maps to memory. In my case, as you have mentioned, it is hard to determine which part of the memory is accessed by the threads due to the redundancy in my computations. Also, I tried to change the memory reference range for [i] tid[i] to be 0 through (n-1) instead of 2. The first part of the computation was fine but there are issues when I try to index into the array for multiplication/decomposition.

Pardon me if I am wrong, but the n I have specified here is the number of rows in the array which can vary while the number of columns is a constant (3). Which means if my n is 341 then my array would have 3413 floats and 3413*4 (4092) bytes of memory would be required which is well within the limits. Then, do I get the kernel launch failure error due to the fact that I try to allocate about 3-5 more shared memory arrays of the same size(including all the variables in the code) as the array I pass in during runtime? Also, could you please let me know if using constant memory would help overcome the memory access problems. I have not tried it but would it be possible to use constant memory to parallelize these computations?

Thanks again.


Whenever you use tid as part of a shared memory address, your active threads will be accessing 3 banks, causing serialization (see Section of the Programming Guide).

Yes. If you look at the call to your kernel, you request mem_size_A4 bytes, where mem_size_A = n3*sizeof(float). That’s how you end up requesting 48n bytes.

Constant memory is likely to help, since it is cached (Section However, your main problem is that the majority of your computation is not parallelized at all (all threads do the same thing with identical data) - you need to partition the work among the threads.


Not only that, but a thread block that is 3xn will be made up of n warps that only have 3 active threads each. That means you’ll be using only 3/16 of the GPU!

It’s better to transpose your thread block to be nx3 threads, then if n is a multiple of 16, you will not waste any GPU processors. Even if it is not a multiple of 16, you will have at most 3 non-full warps.

BTW, what is the range of sizes of n in this computation?


Hi Mark,

There is no range for n. It could be any number. That is where I am quite confused about how I could segment the data into blocks by which I could parallelize most part of the computation.

Could you please give me any idea about this?