Optimization Issues

Hi,

to make some tests, i wrote the following kernel:

__global__ void access(float *pfIn, unsigned long ulCount, unsigned long ulValues, unsigned long ulStride, unsigned long ulOffset)

{

	unsigned long lulI=blockIdx.x * blockDim.x + threadIdx.x;

	float lfGesamt = 0;

	if (lulI<ulCount)	

	{

		unsigned long pos = (lulI*ulStride+ulOffset)%ulCount;

		for (unsigned long lulJ=0; lulJ<ulValues; lulJ++)

		{

			lfGesamt += pfIn[pos];

		}

	}

}

To make it do what the code says i turned off the optimization by passing --opencc-options -O0, which worked well so far.

The Problem: I can’t see any effect neither of the parameters “stride” nor of “offset”. The Bandwidth is always the same. Also i don’t see any effect of the L1 and L2 Cache of my GT540 M, the calculated bandwidth is equal when i read 1 value or if i do it 10 times.

Whats the matter?

[font=“Courier New”]–opencc-options -O0[/font] does not turn off all optimization, as quite a bit of optimization is going on inside ptxas.

Optimazation is ok, when i generate gpu debug information (at least i hope so)

When i don’t generate the debug info, i achieve a bandwidth of 200gb/s at a 9800 gt, specification says it has about 58 Gb/s…

either am i doing something incredibly wrong, or it is still too optimized.

The other is the not-caching and the equal bandwidth with stride&offset.

But probably the stride gets equalized since every block only reads the block memory, and the offset should result in maximum 1 more coalesced access at cc 2.1

So the main problem left is the equal bandwidth between 1 and 10 reads which should get cached.

Your kernel is accessing the same position of global memory over and over meaning you are possibly getting cache hits which would give you a higher effective bandwidth. You should run this through the visual profiler and see what your cache hit rate is.

9800 gt shouldn’t have any sort of cache, except the texture cache.

i modified the kernel now:

__global__ void access(float *pfOut,float *pfIn, unsigned long ulCount, unsigned long ulValues, unsigned long ulStride, unsigned long ulOffset)

{

	unsigned long lulI=blockIdx.x * blockDim.x + threadIdx.x;

	unsigned long laulPos[16];

	float lfGesamt = 0;

	if (lulI<ulCount)	

	{	

		laulPos[ 0] = (lulI*ulStride+ulOffset      )%ulCount;

		laulPos[ 1] = (lulI*ulStride+ulOffset+ 1*16)%ulCount;

		laulPos[ 2] = (lulI*ulStride+ulOffset+ 2*16)%ulCount;

		laulPos[ 3] = (lulI*ulStride+ulOffset+ 3*16)%ulCount;

		laulPos[ 4] = (lulI*ulStride+ulOffset+ 4*16)%ulCount;

		laulPos[ 5] = (lulI*ulStride+ulOffset+ 5*16)%ulCount;

		laulPos[ 6] = (lulI*ulStride+ulOffset+ 6*16)%ulCount;

		laulPos[ 7] = (lulI*ulStride+ulOffset+ 7*16)%ulCount;

		laulPos[ 8] = (lulI*ulStride+ulOffset+ 8*16)%ulCount;

		laulPos[ 9] = (lulI*ulStride+ulOffset+ 9*16)%ulCount;

		laulPos[10] = (lulI*ulStride+ulOffset+10*16)%ulCount;

		laulPos[11] = (lulI*ulStride+ulOffset+11*16)%ulCount;

		laulPos[12] = (lulI*ulStride+ulOffset+12*16)%ulCount;

		laulPos[13] = (lulI*ulStride+ulOffset+13*16)%ulCount;

		laulPos[14] = (lulI*ulStride+ulOffset+14*16)%ulCount;

		laulPos[15] = (lulI*ulStride+ulOffset+15*16)%ulCount;

		for (unsigned long lulJ=0; lulJ<ulValues; lulJ++)

		{

			lfGesamt += pfIn[laulPos[lulJ]];

		}

		pfOut[lulI] = lfGesamt;

	}

}

i run the kernel with ulvalues = 0 before, and subtract the time from the real kernel run.

Results are pretty ok now, I’m at 21gb/s at a 9800gt, which is still far from peak bandwidth.

The use of tex1Dfetch gives a speedup of ~40%, …i expected more ;(

Only for consideration:

ptx code is the final gpu machine code, yes? Does the GPU optimize it’s execution at runtime? (e.g. by skipping useless loops)

No, it’s intermediate code for a virtual machine with an infinite number of registers. It is optimized when compiled (“assembled”) for the real architecture. Use [font=“Courier New”]cuobjdump -sass[/font] on the .cubin file to see (disassembled) machine code.

This code should be optimized out by the compiler since you dont write anything back to gmem so no real need to run it…

eyal