Hi there!!!
I have this peace of code calculating the maximum value among 16 threads, but the problem I’m launching the kernel inside a “for” cycle and I think with a little tweaking in the kernel I will not need this launch method. I just need help to do it.
The code goes like this:
DEVICE CODE
global void ReductionKernel(float* pVector, int N, float* bigger)
{
unsigned int tid = threadIdx.x;
unsigned int i = blockIdx.x* blockDim.x+ threadIdx.x;
__shared__ float sh[16];
if(i<N)
sh[tid]= pVector[i];
else
sh[tid]=0;
__syncthreads();
for(int s=blockDim.x/2;s>0;s=s/2){
if(tid<s)
if(sh[tid+s] > sh[tid])
sh[tid]=sh[tid+s];
__syncthreads();
}
if(tid==0)
bigger[0]=sh[tid];
}
HOST CODE
and I’m using this to launch it:
for (int i=0;i<numvertex;i++)
{
radiusTemp(vTempHost,radiusHost,i); //fuction to load to vTemp the values for the kernel to calculate the maximum.
cudaMemcpy(vTempDevice, vTempHost, sizeTemp, cudaMemcpyHostToDevice);
ReductionKernel<<<1,BLOCKSIZE>>>(vTempDevice,16,biggestDevice);
cudaMemcpy(vTempHost, vTempDevice, sizeTemp, cudaMemcpyDeviceToHost);
cudaMemcpy(biggestHost, biggestDevice, sizeof(float), cudaMemcpyDeviceToHost);
vectorRadiusTemp[i]=biggestHost[0];
}
If someone know how to change this so I will not need this “for” cycle, because it is taking to long, I think is possible to optimize it, but I can’t figure it out.