Iteration help in CUDA

fienefie · April 17, 2012, 2:32am

I have written a code in C and I will like to re-write the code in CUDA to increase the performance of the calculation. Below I have the C code and cuda code. The program runs but results is not correct. And I think the problem is with the iteration. A couple of guidelines with the iteration should put me back on the right track.

C code:

for (i = 1; i <= nodecount; ++i){
xx = x_h[i-1];
yy = y_h[i-1];
sum = 0.;
for (p = 1; p <= nodecount; ++p) {
d1 = xx - x_h[p - 1];
d2 = yy - y_h[p - 1];
factor=alpha*(pow(d1,2)+pow(d2,2));
rbf[p-1]=exp(-factor);
sum+=rbf[p-1] * rhs_h[p-1];
}
numer[i-1]=sum;

}

CUDA code:

void gpuGaussian_solve(int N, float *num, float *soln, float2 *node, float alpha)
{
int threadsPerBlock = 256;
int blocksPerGrid = (N +threadsPerBlock-1)/threadsPerBlock;

kernel_solve <<< blocksPerGrid, threadsPerBlock >>>(N,num,soln,node, alpha);

cudaThreadSynchronize();
cudaError_t errorStatus = cudaGetLastError();
if( errorStatus != cudaSuccess )
{
	printf( "Matrix solve: %s \n", cudaGetErrorString( errorStatus ) );
}

}

global void kernel_solve(int N, float* num, float* soln, float2 node, float alpha)
{
int i = threadIdx.x + blockIdx.xblockDim.x;
int j = threadIdx.y + blockIdx.y*blockDim.y;

if (j < N)
{
num[j]=computeSoln(node[j],node[i],alpha,soln[i]);
}

}

device float computeSoln(float2 row, float2 col, float alpha, float soln)
{
float sum, d1, d2, d3, d4;
sum=0.;
d1 = row.x - col.x;
d2 = row.y - col.y;
d3 = alpha * (pow(d1,2)+pow(d2,2));
d4 = exp(-d3);
sum =sum + d4*soln;

return sum;

}

tera · April 17, 2012, 9:39am

You seem to be passing host pointers directly to a kernel on the device. You need to copy the data from host to device memory first (and copy the result back to host memory) using cudaMemcpy().

A.Goude · April 17, 2012, 11:31am

In the CPU version, you increase the value of sum in a for loop and add the final value to numel[i-1]. In the CUDA version, you directly assign the value with num[j]=computeSoln(node[j],node[i],alpha,soln[i]);
This means that no sum is calculated, instead, all threads with equal value for j will write their value to this position.

fienefie · April 17, 2012, 4:52pm

No. I have already copied the data from host to device memory. [font=“arial, verdana, tahoma, sans-serif”]I am not getting compilation or execution error. Just that the results are not correct.[/font] The function [font=“arial, verdana, tahoma, sans-serif”]gpuGaussian_solve(int N, float *num, float *soln, float2 *node, float alpha) is a C wrapper for the kernel(.cu function) and it is being called in a C program. The problem is how do I get the inner iteration to obtain the results of the calculation for every single outer iteration as I have shown in the C program. Thanks.[/font]

A.Goude · April 17, 2012, 5:07pm

The easiest solution would be to let one thread take one node and perform the entire inner loop by itself, but this would limit the number of threads that you can use to nodecount.

One other solution would be to have one block handle one (or more) nodes and let each thread calculate one part of the sum, save it to shared memory and perform the summation in shared memory and finally add the results. Look for example at [post=‘CUDA - Tutorial 3 - Thread Communication | The Supercomputing Blog’]http://supercomputingblog.com/cuda/cuda-tutorial-3-thread-communication/[/post] for an example of doing the reduction.

One third way would be to use atomicAdd (requires compute capability 2.0 for float) to perform the summation. You can of course combine this solution with the above solution if you want to have multiple blocks handling the same node, as communication through shared memory usually is faster than using too many atomic additions in global memory

fienefie · April 17, 2012, 5:10pm

You are right. But I am not sure how to fix it. In the cuda code, I want to put the final value of sum in num[j]. I changed the iteration. But the results is not the same as numer[i-1] from the CPU. For example, if the CPU program gives me a result like 1.4E-14, the GPU is giving me a result of 1.4E-7. The disparity gets bigger as I increase the number of iterations(nodecount). Below is my new code.

CUDA code:

void gpuGaussian_solve(int N, float *num, float *soln, float2 *node, float alpha)

{

//int threadsPerBlock = 256;

//int blocksPerGrid = (N +threadsPerBlock-1)/threadsPerBlock;

dim3 blocks(N,1);

dim3 grids(N,1);

kernel_solve <<< grids, blocks >>>(N,num,soln,node, alpha);

cudaThreadSynchronize();

cudaError_t errorStatus = cudaGetLastError();

if( errorStatus != cudaSuccess )

{

	printf( "Matrix solve: %s \n", cudaGetErrorString( errorStatus ) );

}

}

global void kernel_solve(int N, float* num, float* soln, float2 *node, float alpha)

{

int i =  blockIdx.x; // Outer loop divided into blocks

int j = threadIdx.x; // Inner loop divided into threads

if (i < N && j < N)

{

num[i]=computeSoln(node[i],node[j],alpha,soln[j]);

}

}

device float computeSoln(float2 row, float2 col, float alpha, float soln)

{

float sum, d1, d2, d3, d4, xx, yy;

xx = row.x;

yy = row.y;

sum=0.f;

d1 = xx - col.x;

d2 = yy - col.y;

d3 = alpha * (pow(d1,2)+pow(d2,2));

d4 = exp(-d3);

sum =sum + d4*soln;  // This is where the problem might me. I want sum to carry the final value at the end of the inner iteration.



return sum;

}

njuffa · April 17, 2012, 5:36pm

Side remark: You would want to replace the use of pow() to square d1 and d2 with simple multiplications (see CUDA 4.1 Best Practices Guide, page 56).

pasoleatis · April 17, 2012, 6:53pm

You are right. But I am not sure how to fix it. In the cuda code, I want to put the final value of sum in num[j]. I changed the iteration. But the results is not the same as numer[i-1] from the CPU. For example, if the CPU program gives me a result like 1.4E-14, the GPU is giving me a result of 1.4E-7. The disparity gets bigger as I increase the number of iterations(nodecount). Below is my new code.

CUDA code:

void gpuGaussian_solve(int N, float *num, float *soln, float2 *node, float alpha)

{

//int threadsPerBlock = 256;

//int blocksPerGrid = (N +threadsPerBlock-1)/threadsPerBlock;
dim3 blocks(N,1);

dim3 grids(N,1);
kernel_solve <<< grids, blocks >>>(N,num,soln,node, alpha);

cudaThreadSynchronize();
cudaError_t errorStatus = cudaGetLastError();

if( errorStatus != cudaSuccess )

{

	printf( "Matrix solve: %s \n", cudaGetErrorString( errorStatus ) );

}
}

global void kernel_solve(int N, float* num, float* soln, float2 *node, float alpha)

{
int i =  blockIdx.x; // Outer loop divided into blocks

int j = threadIdx.x; // Inner loop divided into threads

if (i < N && j < N)

{

num[i]=computeSoln(node[i],node[j],alpha,soln[j]);

}
}

device float computeSoln(float2 row, float2 col, float alpha, float soln)

{
float sum, d1, d2, d3, d4, xx, yy;

xx = row.x;

yy = row.y;

sum=0.f;

d1 = xx - col.x;

d2 = yy - col.y;

d3 = alpha * (pow(d1,2)+pow(d2,2));

d4 = exp(-d3);

sum =sum + d4*soln;  // This is where the problem might me. I want sum to carry the final value at the end of the inner iteration.



return sum;
}

Hello,

Each thread makes an element of the sum, but you have to reduce all these to get the sum. There are several ways to do this. The most direct one is to save everything in an array and use the THRUST library to make the sum with the reduction sum function. If you program yourself the sum I suggest ot take a look in the book CUDA by example or check this article http://http.developer.nvidia.com/GPUGems3/gpugems3_ch39.html or maybe check the scan reduction example from the SDK CUDA Toolkit Documentation 12.3 Update 1

In addition just check google , many tutorials are online cuda sum - Google Search

fienefie · April 19, 2012, 7:02am

The easiest solution would be to let one thread take one node and perform the entire inner loop by itself, but this would limit the number of threads that you can use to nodecount.

One other solution would be to have one block handle one (or more) nodes and let each thread calculate one part of the sum, save it to shared memory and perform the summation in shared memory and finally add the results. Look for example at [post=‘CUDA - Tutorial 3 - Thread Communication | The Supercomputing Blog’]http://supercomputin…-communication/[/post] for an example of doing the reduction.

One third way would be to use atomicAdd (requires compute capability 2.0 for float) to perform the summation. You can of course combine this solution with the above solution if you want to have multiple blocks handling the same node, as communication through shared memory usually is faster than using too many atomic additions in global memory

Thanks for the reply. I took the CUDA SDK code VecAdd and modified it to emulate a simplified version of my code. Can you tell me what I am missing? The outer loop is not iterating as I expect.

[font=“arial, sans-serif”]In C:[/font][font=“arial, sans-serif”] [/font][font=“arial, sans-serif”]Matrix A: 1 2 3 4[/font]

[font=“arial, sans-serif”]Matrix B: 1 4 9 16[/font]

[font=“arial, sans-serif”]N = 4;[/font][font=“arial, sans-serif”] [/font][font=“arial, sans-serif”]for (i=0; i< N ; i++)[/font]

[font=“arial, sans-serif”]{[/font][font=“arial, sans-serif”] [/font]

[font=“arial, sans-serif”] sum=0;[/font][font=“arial, sans-serif”] [/font]

[font=“arial, sans-serif”] for(j=0; j<N; j++)[/font][font=“arial, sans-serif”] [/font]

[font=“arial, sans-serif”] {[/font][font=“arial, sans-serif”] [/font]

[font=“arial, sans-serif”] sum+=B[i]-A[j];[/font][font=“arial, sans-serif”] [/font]

[font=“arial, sans-serif”] }[/font][font=“arial, sans-serif”] [/font]

[font=“arial, sans-serif”] C[i] = sum;[/font]

[font=“arial, sans-serif”]}[/font][font=“arial, sans-serif”] [/font]

[font=“arial, sans-serif”]Print Matrix C: -6 6 26 54 (Correct)[/font][font=“arial, sans-serif”] [/font][font=“arial, sans-serif”]___________________________________________________________[/font]

[font=“arial, sans-serif”] [/font]

[font=“arial, sans-serif”] [/font][font=“arial, sans-serif”]In CUDA:[/font][font=“arial, sans-serif”] [/font][font=“arial, sans-serif”]

global void VecAdd( float* A, float* B, float* C, int N, float *D)[/font]

{

__shared__ float cache[256];

int i = blockDim.x * blockIdx.x + threadIdx.x;

int j = blockDim.y * blockIdx.y + threadIdx.y;

int cacheIndex = threadIdx.x;

float sum;



	sum = 0.f;

	if(i<N && j<N)

	{

		

         sum += B[j] - A[i];

         

		// i += blockDim.x*gridDim.x;

	



     // set the cache values

     		cache[cacheIndex] = sum;

// synchronize threads in this block

            __syncthreads();

	int k = blockDim.x/2;

     		while (k != 0) 

          {

  		if (cacheIndex < k)

	  		cache[cacheIndex] += cache[cacheIndex + k];

		__syncthreads();

		k /= 2;

     		}

	if (cacheIndex == 0)

		C[blockIdx.x] = cache[0];

// D[blockIdx.x] = blockIdx.x;

  	//   i +=blockDim.x*gridDim.x;

   	//  j += blockDim.y*gridDim.y;

}

[font="arial, sans-serif"]Print Matrix C: -6 0 0 0 (Incomplete)[/font]

pasoleatis · April 19, 2012, 7:42am

Thanks for the reply. I took the CUDA SDK code VecAdd and modified it to emulate a simplified version of my code. Can you tell me what I am missing? The outer loop is not iterating as I expect.

for (i=0; i< N ; i++)

{

sum=0;

for(j=0; j<N; j++)

{

sum+=B[i]-A[j];

}

C[i] = sum;

}

Print Matrix C: -6 6 26 54 (Correct) ___________________________________________________________

In CUDA:

__global__ void VecAdd( float* A, float* B, float* C, int N, float *D)

{

__shared__ float cache[256];

int i = blockDim.x * blockIdx.x + threadIdx.x;

int j = blockDim.y * blockIdx.y + threadIdx.y;

int cacheIndex = threadIdx.x;

float sum;

sum = 0.f;

if(i<N && j<N)

{

sum += B[j] - A[i];

// i += blockDim.x*gridDim.x;

// set the cache values

cache[cacheIndex] = sum;

// synchronize threads in this block

__syncthreads();

int k = blockDim.x/2;

while (k != 0)

{

if (cacheIndex < k)

cache[cacheIndex] += cache[cacheIndex + k];

__syncthreads();

k /= 2;

}

if (cacheIndex == 0)

C[blockIdx.x] = cache[0];

// D[blockIdx.x] = blockIdx.x;

// i +=blockDim.x*gridDim.x;

// j += blockDim.y*gridDim.y;

}

Print Matrix C: -6 0 0 0 (Incomplete)

Hello,

How are you calling the kernel vecadd?

Here is a little bit simpler version (not the most efficient, but it will do the job) . Each i is calculated by one block.

__global__ void VecAdd( float* A, float* B, float* C, int N, float *D)

{

__shared__ float cache[256];

int i = blockIdx.x;

int j = threadIdx.x;

int cacheIndex = threadIdx.x;

float sum;

sum = 0.f;

if(i<N && j<N)

{

for(jmore=j;jmore<N,jmore=jmore+256) // This loop is in case N>256.

{

    sum += B[jmore] - A[i];

}

// set the cache values

cache[cacheIndex] = sum;

// synchronize threads in this block

__syncthreads();

int k = blockDim.x/2;

while (k != 0)

{

if (cacheIndex < k)

cache[cacheIndex] += cache[cacheIndex + k];

__syncthreads();

k /= 2;

}

if (cacheIndex == 0)

C[blockIdx.x] = cache[0];

}

In the host you use

vecadd<<<N,256>>>(arguments);

Please put your code in to between these [ code ] [ /code ] It will make the reading easier.

A.Goude · April 19, 2012, 7:46am

I think the issue is with the 2 dimensional thread indexing. You put cacheIndex to threadIdx.x, but in the two-dimensional case, several threads will have the same threadIdx.x, causing the values in cache to be overwritten. You need to change cacheIndex to make it unique for each thread, or make cache two-dimensional as well: shared float cache[16][16]; and use cache[threadIdx.y][cacheIndex] += cache[threadIdx.y][cacheIndex + k];

When you finally update your coefficients with C[blockIdx.x]=cache[0], you are only writing one element, which is why the rest of them is zero.

As a side remark, I think the code will be faster if you transpose your matrix and make the reduction sum over blockIdx.y instead, which will avoid having unnecessary warp divergences.

fienefie · April 19, 2012, 8:38am

Thanks for all the help. The problem is fixed now. I will look into transposing the matrix and making the reduction sum over blockIdx.y instead. Once again, thank you so much.

Topic		Replies	Views
Double loop, sum on j for each i. atomicAdd()? CUDA Programming and Performance	20	7057	March 25, 2012
double loop with inner loop sum using reduction CUDA Programming and Performance	7	1678	April 3, 2012
Sum vectors CUDA Programming and Performance	24	5150	December 21, 2011
multiple threads writing value to a same variable CUDA Programming and Performance	19	6770	March 20, 2012
Iterative computations Creating of efficient iterative computations using CUDA CUDA Programming and Performance	16	2324	January 25, 2011
Sum of N numbers in parallel in pairs without repetition. CUDA Programming and Performance	23	2686	December 20, 2011
Same Code (really, it is) - Much Different Results CUDA Programming and Performance	38	13881	September 30, 2010
CUDA kernel for-loop performance CUDA Programming and Performance	16	6665	September 7, 2019
Effective Parallelisation of CUDA C code CUDA Programming and Performance	38	2033	December 27, 2021
Nested loops in CUDA Legacy PGI Compilers	13	9783	July 12, 2019

Iteration help in CUDA

Related topics