Hi all! I’m starting my Master’s Thesis on CUDA programming in order to achieve faster aquisition times in a tomosynthesis machine. The problem is that i dont come from the informatic area(Im a biomedical engineer), so my background of programming languages is not that wide, so i think ill be spending a lot of time in this forum.
I’m making my first CUDA programs and im having a hard time getting some good results. The program should receive a vector in the GPU and add its elements(its already optimized in order not to branch diverge):
global void vecsum(float*vec){
shared float temp[16];
int i;
int idx=threadIdx.x+blockIdx.x*blockDim.x;
temp[idx]=vec[idx];
__syncthreads();
for(i=blockDim.x;i>0;i=i/2)
if(idx<blockDim.x)
temp[idx]+=temp[idx+i];
__syncthreads();
if(idx == 0)
vec[0] = temp[0];
}
int main(){
float N[16]={0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0};
float vec_h=(float)malloc(sizeof(float)16);
float vec_d;
cudaMalloc((void)&vec_d,16sizeof(float));
cudaMemcpy(vec_d,N,16sizeof(float),cudaMemcpyHostToDevice);
vecsum<<<2,8>>>(vec_d);
cudaMemcpy(vec_h,vec_d,16*sizeof(float),cudaMemcpyDeviceToHost);
cudaFree(vec_d);
printVec(vec_h,16);
}
The weird thing is that, besides the wrong result,that same result, changes from execution to execution.
{506.242615 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 }
{520.242615 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 }
I know that a zero-filled vector is stupid, but i had already tried with a vector[1-15] and all the elements were right except of course the important one, the first.
Thanks in advance.