Iteration help in CUDA

I have written a code in C and I will like to re-write the code in CUDA to increase the performance of the calculation. Below I have the C code and cuda code. The program runs but results is not correct. And I think the problem is with the iteration. A couple of guidelines with the iteration should put me back on the right track.

C code:


for (i = 1; i <= nodecount; ++i){
xx = x_h[i-1];
yy = y_h[i-1];
sum = 0.;
for (p = 1; p <= nodecount; ++p) {
d1 = xx - x_h[p - 1];
d2 = yy - y_h[p - 1];
factor=alpha*(pow(d1,2)+pow(d2,2));
rbf[p-1]=exp(-factor);
sum+=rbf[p-1] * rhs_h[p-1];
}
numer[i-1]=sum;

}

CUDA code:


void gpuGaussian_solve(int N, float *num, float *soln, float2 *node, float alpha)
{
int threadsPerBlock = 256;
int blocksPerGrid = (N +threadsPerBlock-1)/threadsPerBlock;

kernel_solve <<< blocksPerGrid, threadsPerBlock >>>(N,num,soln,node, alpha);

cudaThreadSynchronize();
cudaError_t errorStatus = cudaGetLastError();
if( errorStatus != cudaSuccess )
{
	printf( "Matrix solve: %s \n", cudaGetErrorString( errorStatus ) );
}

}

global void kernel_solve(int N, float* num, float* soln, float2 node, float alpha)
{
int i = threadIdx.x + blockIdx.x
blockDim.x;
int j = threadIdx.y + blockIdx.y*blockDim.y;

if (j < N)
{
num[j]=computeSoln(node[j],node[i],alpha,soln[i]);
}

}

device float computeSoln(float2 row, float2 col, float alpha, float soln)
{
float sum, d1, d2, d3, d4;
sum=0.;
d1 = row.x - col.x;
d2 = row.y - col.y;
d3 = alpha * (pow(d1,2)+pow(d2,2));
d4 = exp(-d3);
sum =sum + d4*soln;

return sum;

}

You seem to be passing host pointers directly to a kernel on the device. You need to copy the data from host to device memory first (and copy the result back to host memory) using cudaMemcpy().

In the CPU version, you increase the value of sum in a for loop and add the final value to numel[i-1]. In the CUDA version, you directly assign the value with num[j]=computeSoln(node[j],node[i],alpha,soln[i]);
This means that no sum is calculated, instead, all threads with equal value for j will write their value to this position.

No. I have already copied the data from host to device memory. [font=“arial, verdana, tahoma, sans-serif”]I am not getting compilation or execution error. Just that the results are not correct.[/font] The function [font=“arial, verdana, tahoma, sans-serif”]gpuGaussian_solve(int N, float *num, float *soln, float2 *node, float alpha) is a C wrapper for the kernel(.cu function) and it is being called in a C program. The problem is how do I get the inner iteration to obtain the results of the calculation for every single outer iteration as I have shown in the C program. Thanks.[/font]

The easiest solution would be to let one thread take one node and perform the entire inner loop by itself, but this would limit the number of threads that you can use to nodecount.

One other solution would be to have one block handle one (or more) nodes and let each thread calculate one part of the sum, save it to shared memory and perform the summation in shared memory and finally add the results. Look for example at [post=‘CUDA - Tutorial 3 - Thread Communication | The Supercomputing Blog’]http://supercomputingblog.com/cuda/cuda-tutorial-3-thread-communication/[/post] for an example of doing the reduction.

One third way would be to use atomicAdd (requires compute capability 2.0 for float) to perform the summation. You can of course combine this solution with the above solution if you want to have multiple blocks handling the same node, as communication through shared memory usually is faster than using too many atomic additions in global memory

You are right. But I am not sure how to fix it. In the cuda code, I want to put the final value of sum in num[j]. I changed the iteration. But the results is not the same as numer[i-1] from the CPU. For example, if the CPU program gives me a result like 1.4E-14, the GPU is giving me a result of 1.4E-7. The disparity gets bigger as I increase the number of iterations(nodecount). Below is my new code.

CUDA code:


void gpuGaussian_solve(int N, float *num, float *soln, float2 *node, float alpha)

{

//int threadsPerBlock = 256;

//int blocksPerGrid = (N +threadsPerBlock-1)/threadsPerBlock;

dim3 blocks(N,1);

dim3 grids(N,1);

kernel_solve <<< grids, blocks >>>(N,num,soln,node, alpha);

cudaThreadSynchronize();

cudaError_t errorStatus = cudaGetLastError();

if( errorStatus != cudaSuccess )

{

	printf( "Matrix solve: %s \n", cudaGetErrorString( errorStatus ) );

}

}

global void kernel_solve(int N, float* num, float* soln, float2 *node, float alpha)

{

int i =  blockIdx.x; // Outer loop divided into blocks

int j = threadIdx.x; // Inner loop divided into threads

if (i < N && j < N)

{

num[i]=computeSoln(node[i],node[j],alpha,soln[j]);

}

}

device float computeSoln(float2 row, float2 col, float alpha, float soln)

{

float sum, d1, d2, d3, d4, xx, yy;

xx = row.x;

yy = row.y;

sum=0.f;

d1 = xx - col.x;

d2 = yy - col.y;

d3 = alpha * (pow(d1,2)+pow(d2,2));

d4 = exp(-d3);

sum =sum + d4*soln;  // This is where the problem might me. I want sum to carry the final value at the end of the inner iteration.



return sum;

}

Side remark: You would want to replace the use of pow() to square d1 and d2 with simple multiplications (see CUDA 4.1 Best Practices Guide, page 56).

Hello,

Each thread makes an element of the sum, but you have to reduce all these to get the sum. There are several ways to do this. The most direct one is to save everything in an array and use the THRUST library to make the sum with the reduction sum function. If you program yourself the sum I suggest ot take a look in the book CUDA by example or check this article http://http.developer.nvidia.com/GPUGems3/gpugems3_ch39.html or maybe check the scan reduction example from the SDK CUDA Toolkit Documentation 12.3 Update 1

In addition just check google , many tutorials are online cuda sum - Google Search

Thanks for the reply. I took the CUDA SDK code VecAdd and modified it to emulate a simplified version of my code. Can you tell me what I am missing? The outer loop is not iterating as I expect.

[font=“arial, sans-serif”]In C:[/font][font=“arial, sans-serif”] [/font][font=“arial, sans-serif”]Matrix A: 1 2 3 4[/font]

[font=“arial, sans-serif”]Matrix B: 1 4 9 16[/font]

[font=“arial, sans-serif”]N = 4;[/font][font=“arial, sans-serif”] [/font][font=“arial, sans-serif”]for (i=0; i< N ; i++)[/font]

[font=“arial, sans-serif”]{[/font][font=“arial, sans-serif”] [/font]

[font=“arial, sans-serif”] sum=0;[/font][font=“arial, sans-serif”] [/font]

[font=“arial, sans-serif”] for(j=0; j<N; j++)[/font][font=“arial, sans-serif”] [/font]

[font=“arial, sans-serif”] {[/font][font=“arial, sans-serif”] [/font]

[font=“arial, sans-serif”] sum+=B[i]-A[j];[/font][font=“arial, sans-serif”] [/font]

[font=“arial, sans-serif”] }[/font][font=“arial, sans-serif”] [/font]

[font=“arial, sans-serif”] C[i] = sum;[/font]

[font=“arial, sans-serif”]}[/font][font=“arial, sans-serif”] [/font]

[font=“arial, sans-serif”]Print Matrix C: -6 6 26 54 (Correct)[/font][font=“arial, sans-serif”] [/font][font=“arial, sans-serif”]___________________________________________________________[/font]

[font=“arial, sans-serif”] [/font]

[font=“arial, sans-serif”] [/font][font=“arial, sans-serif”]In CUDA:[/font][font=“arial, sans-serif”] [/font][font=“arial, sans-serif”]

global void VecAdd( float* A, float* B, float* C, int N, float *D)[/font]

{

__shared__ float cache[256];

int i = blockDim.x * blockIdx.x + threadIdx.x;

int j = blockDim.y * blockIdx.y + threadIdx.y;

int cacheIndex = threadIdx.x;

float sum;



	sum = 0.f;

	if(i<N && j<N)

	{

		

         sum += B[j] - A[i];

         

		// i += blockDim.x*gridDim.x;

	



     // set the cache values

     		cache[cacheIndex] = sum;

// synchronize threads in this block

            __syncthreads();

	int k = blockDim.x/2;

     		while (k != 0) 

          {

  		if (cacheIndex < k)

	  		cache[cacheIndex] += cache[cacheIndex + k];

		__syncthreads();

		k /= 2;

     		}

	if (cacheIndex == 0)

		C[blockIdx.x] = cache[0];

// D[blockIdx.x] = blockIdx.x;

  	//   i +=blockDim.x*gridDim.x;

   	//  j += blockDim.y*gridDim.y;

}

[font="arial, sans-serif"]Print Matrix C: -6 0 0 0 (Incomplete)[/font]

Hello,

How are you calling the kernel vecadd?

Here is a little bit simpler version (not the most efficient, but it will do the job) . Each i is calculated by one block.

__global__ void VecAdd( float* A, float* B, float* C, int N, float *D)

{

__shared__ float cache[256];

int i = blockIdx.x;

int j = threadIdx.x;

int cacheIndex = threadIdx.x;

float sum;

sum = 0.f;

if(i<N && j<N)

{

for(jmore=j;jmore<N,jmore=jmore+256) // This loop is in case N>256.

{

    sum += B[jmore] - A[i];

}

// set the cache values

cache[cacheIndex] = sum;

// synchronize threads in this block

__syncthreads();

int k = blockDim.x/2;

while (k != 0)

{

if (cacheIndex < k)

cache[cacheIndex] += cache[cacheIndex + k];

__syncthreads();

k /= 2;

}

if (cacheIndex == 0)

C[blockIdx.x] = cache[0];

}

In the host you use

vecadd<<<N,256>>>(arguments);

Please put your code in to between these [ code ] [ /code ] It will make the reading easier.

I think the issue is with the 2 dimensional thread indexing. You put cacheIndex to threadIdx.x, but in the two-dimensional case, several threads will have the same threadIdx.x, causing the values in cache to be overwritten. You need to change cacheIndex to make it unique for each thread, or make cache two-dimensional as well: shared float cache[16][16]; and use cache[threadIdx.y][cacheIndex] += cache[threadIdx.y][cacheIndex + k];

When you finally update your coefficients with C[blockIdx.x]=cache[0], you are only writing one element, which is why the rest of them is zero.

As a side remark, I think the code will be faster if you transpose your matrix and make the reduction sum over blockIdx.y instead, which will avoid having unnecessary warp divergences.

Thanks for all the help. The problem is fixed now. I will look into transposing the matrix and making the reduction sum over blockIdx.y instead. Once again, thank you so much.