I have written a code in C and I will like to re-write the code in CUDA to increase the performance of the calculation. Below I have the C code and cuda code. The program runs but results is not correct. And I think the problem is with the iteration. A couple of guidelines with the iteration should put me back on the right track.

C code:

for (i = 1; i <= nodecount; ++i){
xx = x_h[i-1];
yy = y_h[i-1];
sum = 0.;
for (p = 1; p <= nodecount; ++p) {
d1 = xx - x_h[p - 1];
d2 = yy - y_h[p - 1];
factor=alpha*(pow(d1,2)+pow(d2,2));
rbf[p-1]=exp(-factor);
sum+=rbf[p-1] * rhs_h[p-1];
}
numer[i-1]=sum;

}

CUDA code:

void gpuGaussian_solve(int N, float *num, float *soln, float2 *node, float alpha)
{
int threadsPerBlock = 256;
int blocksPerGrid = (N +threadsPerBlock-1)/threadsPerBlock;

global void kernel_solve(int N, float* num, float* soln, float2 node, float alpha)
{
int i = threadIdx.x + blockIdx.xblockDim.x;
int j = threadIdx.y + blockIdx.y*blockDim.y;

if (j < N)
{
num[j]=computeSoln(node[j],node[i],alpha,soln[i]);
}

You seem to be passing host pointers directly to a kernel on the device. You need to copy the data from host to device memory first (and copy the result back to host memory) using cudaMemcpy().

In the CPU version, you increase the value of sum in a for loop and add the final value to numel[i-1]. In the CUDA version, you directly assign the value with num[j]=computeSoln(node[j],node[i],alpha,soln[i]);
This means that no sum is calculated, instead, all threads with equal value for j will write their value to this position.

No. I have already copied the data from host to device memory. [font=“arial, verdana, tahoma, sans-serif”]I am not getting compilation or execution error. Just that the results are not correct.[/font] The function [font=“arial, verdana, tahoma, sans-serif”]gpuGaussian_solve(int N, float *num, float *soln, float2 *node, float alpha) is a C wrapper for the kernel(.cu function) and it is being called in a C program. The problem is how do I get the inner iteration to obtain the results of the calculation for every single outer iteration as I have shown in the C program. Thanks.[/font]

The easiest solution would be to let one thread take one node and perform the entire inner loop by itself, but this would limit the number of threads that you can use to nodecount.

One third way would be to use atomicAdd (requires compute capability 2.0 for float) to perform the summation. You can of course combine this solution with the above solution if you want to have multiple blocks handling the same node, as communication through shared memory usually is faster than using too many atomic additions in global memory

You are right. But I am not sure how to fix it. In the cuda code, I want to put the final value of sum in num[j]. I changed the iteration. But the results is not the same as numer[i-1] from the CPU. For example, if the CPU program gives me a result like 1.4E-14, the GPU is giving me a result of 1.4E-7. The disparity gets bigger as I increase the number of iterations(nodecount). Below is my new code.

global void kernel_solve(int N, float* num, float* soln, float2 *node, float alpha)

{

int i = blockIdx.x; // Outer loop divided into blocks
int j = threadIdx.x; // Inner loop divided into threads
if (i < N && j < N)
{
num[i]=computeSoln(node[i],node[j],alpha,soln[j]);
}

float sum, d1, d2, d3, d4, xx, yy;
xx = row.x;
yy = row.y;
sum=0.f;
d1 = xx - col.x;
d2 = yy - col.y;
d3 = alpha * (pow(d1,2)+pow(d2,2));
d4 = exp(-d3);
sum =sum + d4*soln; // This is where the problem might me. I want sum to carry the final value at the end of the inner iteration.
return sum;

Thanks for the reply. I took the CUDA SDK code VecAdd and modified it to emulate a simplified version of my code. Can you tell me what I am missing? The outer loop is not iterating as I expect.

Here is a little bit simpler version (not the most efficient, but it will do the job) . Each i is calculated by one block.

__global__ void VecAdd( float* A, float* B, float* C, int N, float *D)
{
__shared__ float cache[256];
int i = blockIdx.x;
int j = threadIdx.x;
int cacheIndex = threadIdx.x;
float sum;
sum = 0.f;
if(i<N && j<N)
{
for(jmore=j;jmore<N,jmore=jmore+256) // This loop is in case N>256.
{
sum += B[jmore] - A[i];
}
// set the cache values
cache[cacheIndex] = sum;
// synchronize threads in this block
__syncthreads();
int k = blockDim.x/2;
while (k != 0)
{
if (cacheIndex < k)
cache[cacheIndex] += cache[cacheIndex + k];
__syncthreads();
k /= 2;
}
if (cacheIndex == 0)
C[blockIdx.x] = cache[0];
}

In the host you use

vecadd<<<N,256>>>(arguments);

Please put your code in to between these [ code ] [ /code ] It will make the reading easier.

I think the issue is with the 2 dimensional thread indexing. You put cacheIndex to threadIdx.x, but in the two-dimensional case, several threads will have the same threadIdx.x, causing the values in cache to be overwritten. You need to change cacheIndex to make it unique for each thread, or make cache two-dimensional as well: shared float cache[16][16]; and use cache[threadIdx.y][cacheIndex] += cache[threadIdx.y][cacheIndex + k];

When you finally update your coefficients with C[blockIdx.x]=cache[0], you are only writing one element, which is why the rest of them is zero.

As a side remark, I think the code will be faster if you transpose your matrix and make the reduction sum over blockIdx.y instead, which will avoid having unnecessary warp divergences.

Thanks for all the help. The problem is fixed now. I will look into transposing the matrix and making the reduction sum over blockIdx.y instead. Once again, thank you so much.