numblocks and threads allocation rule?

[font=“Arial”]

I have an array of 800*4096 elements. My original C program has a three-level nested loop:

for(i = 0; i < 800; ++i)

{

	for(j = 0; j < 64; ++j)

	{

		for(k = 0; k < 64; ++k)

		{

			x1 = (i-(31*(int)(i/31)))*15 + k + 2048;

			y1 = (i/31)*15 + j + 2048;

			index = (y1-(2048*(int)(y1/2048)))*2048 + x1-(2048*(int)(x1/2048));

			out1[i*4096+j*64+k] = in1[index];

		}

	}

}

CUDA:

I am confused on how to break this into blocks.

I tried to do a calculation using 64 blocks and 64 threads and iterate through them 800 times.

__shared__ int x[64];

__shared__ int y[64];

__shared__ int indx[64];	

int val = 0;

int tid = blockIdx.x * blockDim.x + threadIdx.x;

for(int i = 0; i < 800; ++i)

{

	if(threadIdx.x < 64)

	{

		x[threadIdx.x] = (i-(31*(int)(i/31)))*15 + threadIdx.x + 2048;	

		y[threadIdx.x] = (i/31)*15 + blockIdx.x + 2048;		

		indx[threadIdx.x] = (y[threadIdx.x] - (2048*(int)(y[threadIdx.x]/2048)))*2048 + (x[threadIdx.x] - (2048*(int)(x[threadIdx.x]/2048)));

		val = indx[threadIdx.x];

	}

	__syncthreads();

}

out[tid] = in[val];

any suggestions ??

[/font]

Using shared memory and the [font=“Courier New”]__syncthreads()[/font] are unnecessary. If you only ever call the kernel with a blocksize of 64, the [font=“Courier New”]if(threadIdx.x < 64)[/font] isn’t needed either.

64 blocks are only enough parallelism to saturate entry-level GPUs, so it’s a lot better to use blockIdx.x as i.

The CUDA version doesn’t do the same as the serial version, as the latter stores the results for all [font=“Courier New”]i[/font] but the former only for [font=“Courier New”]i=799[/font].

In general I would keep the CUDA version as close to the serial version as possible. No need to rename variables between them.

unsigned int i = blockIdx.x;

unsigned int k = threadIdx.x;

unsigned int x1 = (i%31)*15 + k + 2048;

unsigned int y1 = (i/31)*15 + 2048;

#pragma unroll 8

for(unsigned int j = 0; j < 64; ++j)

{

        unsigned int index = ((y1++ & 0x7ff) << 11) + (x1 & 0x7ff);

        out1[i*4096+j*64+k] = in1[index];

}

what’s in the memory location 0x7ff ?? how did you do that?

unsigned int index = ((y1++ & 0x7ff) << 11) + (x1 & 0x7ff);

[font=“Courier New”]x1-(2048*(int)(x1/2048)[/font] is the same as [font=“Courier New”]x1 % 2048[/font] which (because 2048 is a power of two) is the same as [font=“Courier New”]x1 & 2047[/font] which I just wrote as [font=“Courier New”]x1 & 0x7ff[/font] to make it more obvious that 2047 is one less than a power of two.

@ tera: Thanks for the clarification. Now I am confused wrt my 800 iterations.
i have 64 blocks and 64 threads and that should undergo 800 iterations as per variable ‘i’ in the CPU code

My example assumes it is now called with a gridsize of 800. If the number of blocks is fixed at 64, you should swap back [font=“Courier New”]i[/font] and [font=“Courier New”]j[/font]. But keep in mind that 128 warps per kernel aren’t enough for anything but the low end cards.

@ tera:

I get it now :) I am using a Fermi C2050 GPU.

I initialize a grid of 800 blocks and then have 64 threads.

the x1 and y1 calculations are done in advance, here’s how I did it:

unsigned int i = blockIdx.x;

unsigned int k = threadIdx.x;

unsigned int x1 = (i%31)*15 + k + 2048;

#pragma unroll 8

for(unsigned int j = 0; j < 64; ++j)

{

        unsigned int y1 = (i/31)*15 + j + 2048;        

        unsigned int index = ((y1++ & 0x7ff) << 11) + (x1 & 0x7ff);

        out1[i*4096+j*64+k] = in1[index];

}

question: why #pragma unroll 8 ??

Loop unrolling helps here because it allows the compiler to move the loads from memory forward, so that more memory transactions are on the fly at any time.
You might have to experiment a bit with it to find the optimal value. I’ve just chosen 8 as a reasonable default value, and I’d expect the throughput to change only slightly with small variations.

EDIT: I had moved the calculation of y1 with the expensive integer division out of the loop, as I wasn’t sure the compiler would do it. You might want to check the .ptx or .cubin or time the code to ensure it’s the case. The code probably is memory bandwidth bound even with the expensive integer division though.

I am trying to see if using shared memory will give some speedup?

unsigned int i = blockIdx.x;

unsigned int k = threadIdx.x;

__shared__ int x1[64];

__shared__ int y1[64];

int tid = blockIdx.x*64+threadIdx.x //assuming that we have 800 blocks with 64 threads each

x1[tid] = (i%31)*15 + k + 2048;

#pragma unroll 8

for(unsigned int j = 0; j < 64; ++j)

{

        y1[tid] = (i/31)*15 + j + 2048;        

        unsigned int index = ((y1[tid]++ & 0x7ff) << 11) + (x1[tid] & 0x7ff);

        out1[i*4096+j*64+k] = in1[index];

}

I tried this but it failed/gave seg fault. Any suggestions on how I can have shared memory usage or pointers on how to use it? Without shared memory usage my CUDA code takes 9 ms which is better than 76 ms taken by CPU code.

Replacing registers by shared memory will only slow down things.

As your code is memory bandwidth bound, you can only expect a speedup corresponding to the memory bandwidth ratio between GPU and CPU. So a speedup of about 8.5 is already quite good.