Ok I’ve been pulling my hair out for a while now trying to get a small CUDA program to go. I’ve run into more than a couple roadblocks, not the least of which is my lack of C programming ability. What I’ve been working to do via CUDA is a multiple equal-sized arrays(2) multiplication and reduction. As plain and as simple as I can put it; I have two arrays of floats both with the same number of elements, I need to multiply the data with the same index from the two arrays together(x[i] * a[i]) and sum/reduce the results. I hope thats clear. I’ve been all through this message board and any other CUDA resource I could find, I’ve also been all over the reduction example provided in the SDK and still only have a tenuous grasp of the concepts(more than likely due to my ineptitude with C). I’ve tried to work off the source of the reduction example and mold it to my uses to no avail…
Details:
H/W- Octo Xeon setup @ 2GHZ, 4GB, GTX260 192
S/W- WinXP SP2, Visual Studio 2005, 178.28 drivers
Code- This is what I have boiled down to the CUDA stuff only. This code is changing practically by the minute now as I try different stuff almost at random to see what works(yea…I am now that desperate to get this simple program to work right)
void CUDAexec(xmlNodePtr node, int sets, float data1[], float data2[], xmlNodeSetPtr nodeList){
int N = sets;
printf("syns: %i \n", N);
float nullResult=0.00f;
float* nullResultPtr= &nullResult;
size_t size = N*sizeof(float);
size_t size2 = sizeof(float);
float* d_data1;
cudaMalloc( (void**) &d_data1, size);
float* d_data2;
cudaMalloc( (void**) &d_data2, size);
float* d_temp;
cudaMalloc( (void**) &d_temp, size);
float* d_result;
cudaMalloc( (void**) &d_result, size2);
cudaMemcpy(d_data1, data1, size, cudaMemcpyHostToDevice);
cudaMemcpy(d_data2, data2, size, cudaMemcpyHostToDevice);
cudaMemcpy(d_result, nullResultPtr, size, cudaMemcpyHostToDevice);
int blockSize = 4;
int nBlocks = N/blockSize + (N%blockSize == 0?0:1);
testKernel <<< nBlocks, blockSize >>> ( d_data1, d_data2, d_temp, d_result, N);
float* h_result = (float*) malloc(sizeof( float));
cudaThreadSynchronize();
cudaMemcpy( h_result, d_result, sizeof( float), cudaMemcpyDeviceToHost);
printf ("result:%f \n", h_result[0]);
//kernel, the closest working one anyways...
__global__ void
testKernel( float* g_data1, float* g_data2, float* g_temp, float* g_result, int N)
{
const unsigned int tid = threadIdx.x;
if (tid == 0){
g_data1[0] += g_data1[tid] * g_data2[tid];
g_result[blockIdx.x] = g_data1[0];
}
else {
g_data1[tid-1] = g_data1[tid] * g_data2[tid];
}
__syncthreads();
}
I loathe having to post to a message board to ask for help but I have read everything available and really am out of ideas here, any help would be greatly appreciated!!
P.S. I almost get the seqeuntial addressing verion in the reduction example, it makes a lot of sense but how exactly does this loop work?
for (unsigned int s=blockDim.x/2; s>0; s>>=1) {
if (tid < s) {
sdata[tid] += sdata[tid + s];
}
I’m struggling with that bitwise adjustment there, how does that work exactly?? More of a basic C question I know, but still…