Hi,
I want to convert a kernel that operates on single block into the kernel which will operate on multiple blocks, making it generalised. Basically, I am doing this because the original kernel can handle only 1024 elements (as it can be seen in the following code, one thread is accessing two elements, 512 will 1024 elements), and my requirement now is to make it valid for any number of elements. The idea is to calculate thid based on blockIdx.x. I tried it but I am not getting correct result. I have written this program assuming n to be in power of 2 (e.g, 2,4,8,16 etc )
If we consider g_idata=[1 2 3 4 5 6 7 8], block size =4 (no. of elements per block), and blockDim.x=2 (i.e total number of threads per block. Note that 2 threads are sufficient for 4 elements, which is the block size ), the out put should be: g_odata=[1 3 3 10 5 11 7 26]; block 0 should returns 1 3 3 10, while block 1 should return 5 11 7 26
I introduced int thid = blockIdx.x * blockDim.x + threadIdx.x; in my modified kernel. But I am getting wrong output. Moreover, it is giving correct result only for the first 3 index, assuming total n=8 elements. I shall be extremely thankful to you guys if you could give me some hint here.
Thank you very much for your precious time.
Original Kernel
__global__ void prescan(float *g_odata, float *g_idata, int n)
{
extern __shared__ float temp[];// allocated on invocation
int thid = threadIdx.x;
int offset = 1;
//loading the data, one thread is accessing two elements.
int ai = thid;
int bi = thid + (n/2);
temp[ai]=g_idata[ai];
temp[bi]=g_idata[bi];
for (int d = n>>1; d > 0; d >>= 1) // build sum in place up the tree
{
__syncthreads();
if (thid < d)
{
int ai = offset*(2*thid+1)-1;
int bi = offset*(2*thid+2)-1;
temp[bi] += temp[ai];
}
offset *= 2;
}
__syncthreads();
g_odata[ai] = temp[ai];
g_odata[bi] = temp[bi];
}
I changed the code as below, but it is giving wrong values.
Modified Kernel
__global__ void Modifiedprescan(float *g_odata, float *g_idata, unsigned int n)
{
extern __shared__ float temp[];// allocated on invocation
int thid = blockIdx.x * blockDim.x + threadIdx.x;
int offset = 1;
//loading data into shared memory
int ai = thid;
int bi = thid + n/2;
temp[ai] = g_idata[ai];
temp[bi] = g_idata[bi];
int thid1=threadIdx.x; //Note this; I introduced thid1
for (int d = 2*blockDim.x >>1; d > 0; d >>= 1) // build sum in place up the tree
{
__syncthreads();
if (thid1< d)
{
int ai = offset*(2*thid1+1)-1;
int bi = offset*(2*thid1+2)-1;
temp[bi] += temp[ai];
}
offset *= 2;
}
__syncthreads();
g_odata[ai] = temp[ai];
g_odata[bi] = temp[bi];
}