Help needed in generalising the kernel converting kernel from single block to multiple blocks

Hi,

I want to convert a kernel that operates on single block into the kernel which will operate on multiple blocks, making it generalised. Basically, I am doing this because the original kernel can handle only 1024 elements (as it can be seen in the following code, one thread is accessing two elements, 512 will 1024 elements), and my requirement now is to make it valid for any number of elements. The idea is to calculate thid based on blockIdx.x. I tried it but I am not getting correct result. I have written this program assuming n to be in power of 2 (e.g, 2,4,8,16 etc )

If we consider g_idata=[1 2 3 4 5 6 7 8], block size =4 (no. of elements per block), and blockDim.x=2 (i.e total number of threads per block. Note that 2 threads are sufficient for 4 elements, which is the block size ), the out put should be: g_odata=[1 3 3 10 5 11 7 26]; block 0 should returns 1 3 3 10, while block 1 should return 5 11 7 26

I introduced int thid = blockIdx.x * blockDim.x + threadIdx.x; in my modified kernel. But I am getting wrong output. Moreover, it is giving correct result only for the first 3 index, assuming total n=8 elements. I shall be extremely thankful to you guys if you could give me some hint here.

Thank you very much for your precious time.

Original Kernel

__global__ void prescan(float *g_odata, float *g_idata, int n)

{

	extern __shared__ float temp[];// allocated on invocation

	int thid = threadIdx.x;

	int offset = 1;

//loading the data, one thread is accessing two elements.  

	int ai = thid;

	int bi = thid + (n/2);

temp[ai]=g_idata[ai];

temp[bi]=g_idata[bi];

	for (int d = n>>1; d > 0; d >>= 1) // build sum in place up the tree

	{

		__syncthreads();

		if (thid < d)

		{

			int ai = offset*(2*thid+1)-1;

			int bi = offset*(2*thid+2)-1;

			temp[bi] += temp[ai];

		}

		offset *= 2;

	}

__syncthreads();

g_odata[ai] = temp[ai];

g_odata[bi] = temp[bi];

}

I changed the code as below, but it is giving wrong values.

Modified Kernel

__global__ void Modifiedprescan(float *g_odata, float *g_idata, unsigned int n)

{

	extern __shared__ float temp[];// allocated on invocation

int thid = blockIdx.x * blockDim.x + threadIdx.x;

int offset = 1;

//loading data into shared memory 

	int ai = thid;

	int bi = thid + n/2;

		

		temp[ai]  = g_idata[ai];

	temp[bi] = g_idata[bi];

int thid1=threadIdx.x;  //Note this; I introduced thid1

	for (int d = 2*blockDim.x >>1; d > 0; d >>= 1) // build sum in place up the tree

	{

	 

		__syncthreads();

		if (thid1< d)

		{

			int ai = offset*(2*thid1+1)-1;

			int bi = offset*(2*thid1+2)-1;

		   temp[bi] += temp[ai];

		 

		}

		offset *= 2;

	}

  __syncthreads();

g_odata[ai] = temp[ai];

g_odata[bi] = temp[bi];

}

One more thing: In the original kernel the first for loop has d=n>>1. In my modified kernel, I changed it to d=2blockDim.x>>1. This because in the original kernel n is the total number of elements in the single block, on the similar line I concluded d= 2blockDim.x>>1, where 2*blockDim.x is the total number of elements in one block.

“thid” calculation is NOT considering the “2” factor (i.e. a block accesses blockDim.x*2 elements). Is that ok?

Thank you so much Sarnarth for your reply.

Well let us consider n=8, block size =4 =>blockDim.x=2. We will need 4 threads, to get all the 8 elements . Let us apply these values and see if can do this using the formulae:

thid=blockIdx.x * blockDim.x + threadIdx.x

blockIdx.x=0

thid = 0 x 2 + 0= 0

thid = 0 x 2 + 1=1

blockIdx.x=1

thid = 1 x 2 + 0= 2

thid = 1 x 2 + 1=3

Thus

thid= 0, 1, 2, 3

Thus exactly what we had in the original kernel.

Here is the host side of the code:

int  main()

{

unsigned int num_Elements, blockSize; //Number of data elements

	  

printf("Please enter the number of elements");

scanf("%d", &num_Elements);

printf("Please enter the Block Size");

scanf("%d", &blockSize);

	size_t size=sizeof(float)*num_Elements;//Size Of Memory

	size_t sizeBlockSums=sizeof(float)*1;

	float *g_odata, *g_idata, *m_odata, *m_idata;  //Declaration of Host and Device Pointer

	m_idata=(float*)malloc(size);//MemoryAllocation On Host

	m_odata=(float*)malloc(size);//Same above

	

	cudaMalloc((void**)&g_idata,size);//Memory Allocation On Device

	cudaMalloc((void**)&g_odata,size);//Same Above

	// g-->global memory, m--> main memory

//Intialization of Elements

	for (int i = 0; i < num_Elements; i++)

{		m_idata[i] = i+1;

}

	cudaMemcpy(g_idata,m_idata,size,cudaMemcpyHostToDevice);//copy the data m_idata from main memory into g_idata in device using cudaMemcpy

int nBlocks=num_Elements/blockSize; //Number of Blocks

unsigned int nThreads=blockSize/2; //blockDimx.x, number of threads, /2 as one thread is loading two elements

printf("\nNumber of blocks is %d\n and number of threads per block(Block Dimesion) is %d\n",  nBlocks, nThreads);

modified<<<nBlocks, nThreads ,size >>>(g_odata, g_idata, num_Elements);//Call the kernel

cudaMemcpy(m_odata, g_odata ,size,cudaMemcpyDeviceToHost);//copy the data g_odata from Device memory into m_odata in Global Memory

for(unsigned int i=0;i<num_Elements;i++)

	{

		printf("m_odata[%d] = %f\n",i+1, m_odata[i]);

	}

//Free Memory of Host and Device

	cudaFree(g_idata);

	cudaFree(out);

	free(m_idata);  

	free(m_odata);

}

Let us analyse the entire program, by calculating values of each variable.

g_idata=[1 2 3 4 5 6 7 8],

block size =4 (no. of elements per block), and blockDim.x=2 (i.e total number of threads per block. Note that 2 threads are sufficient for 4 elements, which is the block size ), the out put should be: g_odata=[1 3 3 10 5 11 7 26]; block 0 should returns 1 3 3 10, while block 1 should return 5 11 7 26

Let us analyse the portion of the code (my modified kernel) shown below for these values

__global__ void Modifiedprescan(float *g_odata, float *g_idata, unsigned int n)

{

	extern __shared__ float temp[];// allocated on invocation

int thid = blockIdx.x * blockDim.x + threadIdx.x;

int offset = 1;

//loading data into shared memory

	int ai = thid;

	int bi = thid + n/2;

		

		temp[ai]  = g_idata[ai];

	temp[bi] = g_idata[bi];

As described in the previous reply, I have calculated thid as:

thid= 0, 1, 2, 3

Using these thid, let us calculate ai and bi and temp [ai] and temp[bi] as per the code above.

thid =0

ai=0, bi= 0 + n/2= 0 + 8/2=4

temp[0]=g_idata[0]= 1

temp[4]=g_oidata[4]=5

thid=1

ai=1, bi=1+8/2=5

temp[1]=g_idata[1]= 2

temp[5]=g_oidata[5]=6

thid=2

ai=2, bi=2+ 8/4=6

temp[2]=g_idata[2]= 3

temp[6]=g_oidata[6]=7

thid=3

ai=3, bi=3 + 8/4= 7

temp[3]=g_idata[3]= 4

temp[7]=g_oidata[7]=8

Thus we see that all the values from global memory has been loaded successfully into shared memory in temp, and my modified kernel is working fine till this point. Now temp has this data: temp[1 2 3 4 5 6 7 8]

Now let us analyse the remaining portion of the code based on the values calculated for temp above:

int thid1=threadIdx.x;  //Note this; I introduced thid1

	for (int d = 2*blockDim.x >>1; d > 0; d >>= 1) // build sum in place up the tree

	{

	

		__syncthreads();

		if (thid1< d)

		{

			int ai = offset*(2*thid1+1)-1;

			int bi = offset*(2*thid1+2)-1;

		   temp[bi] += temp[ai];

		

		}

		offset *= 2;

	}

  __syncthreads();

g_odata[ai] = temp[ai];

g_odata[bi] = temp[bi];

}

Initial value of d= 2 * blockDim.x= 2x2 =4<<1 = 2 (<<1 is nothing but division by 2), which is greater than 0, so the statements in for loop will execute as shown below:

Also thid1 will assume two values only 0 and 1 (2 threads per block, one thread loading two elements. In case of the original code it has 4 values 0,1,2,3,4 )

Result of first iteration

offset=1;

d=2

thid1= 0, 1

if (thid1<d) holds true for 0 and 1. Both thread 0 & 1 will work on the data:

thid1 = 0

ai= offset * (2 thid1 + 1) -1 = 1 (2 x 0 +1) - 1= 0

bi= offset * (2 thid1 + 2) -1 = 1(2 x 0 +2)-1 =1

temp[bi] = temp[bi] + temp[ai];

temp[1]= temp[1] + temp[0]= 2+1=3

thid1 = 1

ai= offset * (2 thid1 + 1) -1 = 1 (2 x 1 +1) - 1= 2

bi= offset * (2 thid1 + 2) -1 = 1(2 x 1 +2)-1 =3

[b] temp[bi] = temp[bi] + temp[ai];[/b

temp[3]= temp[3] + temp[2]= 4+3=7


Thus after first iterations we have following values in temp: [1 3 3 7 5 6 7 8 ] (It is to be noted that i am concentrating only on block 0 here )

Result of Second Iteration

offset=2;

d=1

thid1= 0, 1

if (thid1<d) holds true for only 0. Only thread 0 will work on the data:

thid1 = 0

ai= offset * (2 thid1 + 1) -1 = 2 (2 x 0 +1) - 1= 1

bi= offset * (2 thid1 + 2) -1 = 2(2 x 0 +2)-1 =3

temp[bi] = temp[bi] + temp[ai];

using the updated values of temp above

temp[3]= temp[3] + temp[1]= 7+3=10

.

The loop will stop here. The final first four values in temp and thus g_odata SHOULD be 1 3 3 10

But when I compile this I get the values [b]1 3 3 256[/b basically temp[3] is a garbage. I am unable to figure out why this discrepancy between theoretical and compiled values are coming up? For the time being time being I am not bothering about temp[4], temp[5] temp[6] temp[7] which are in the block 1.

I shall be grateful to you for your help here.