Shared memory bank conflict optimizations bug reveals kernel hot spot, makes no sense

I’m having a bit of a code dilemma. I have a shared mem array that is accessed fully by every thread. Initially, I was reading foo[tx], and this provided about 240 GFLOP/s. since foo is a 16 byte type, this should cause 4-way bank conflicts, and kill performance. This was a bug, since I needed to read foo[i]; after I made the correction, performance dropped to 135 GFLOP/s. Now, since ‘i’ is constant along the half-warp, there should be no bank conflict since the latter should use the broadcast mechanism. Looking at the performance data, this is obviously not the case.

My question is what can be causing such a drop in performance, and how to fix it.

//Kernel start

int tx = threadIdx.x;

int ti = blockDim.x * blockIdx.x + tx;// Used for gloabal mem read/write

...

__shared__ 4byteType foo[BLOCK_X];

3byteType someReg;

// Load someReg

...

// Load foo

__syncthreads();

...

// Completely unroll the following loop

#pragma unroll

for(int i = 0; i < BLOCK_X; i++)

{

	fooFunc(\

		foo[i], // This is the hot spot: foo[tx] is way faster than foo[i]

 		someReg);

}

// Write back whatever result

// Kernel end

The code inside your fooFunc() executes exactly the same way (from performance point of view) whether a wrong parameter, foo[tx], is passed or the right one, foo[i], right?

Yes, fooFunc() does the exact same 15 floating point operations regardless of what values it is passed.

I tested something similar. The broadcast case, [i], performs slightly worse than the thread-indexed access (174ms against 164ms). Can’t think of why there is such difference from descriptions in the programming guide (the PTAX assembly codes are quite identical).

But for the case of your huge performance gap, I guess compiler is optimizing out a portion of your code when you use foo[thx]. See if you can verify this by comparing the generated PTX files. Also, see if foo[thx+i] makes a difference; it still should benefit from conflict-free accesses but less likely that it can be optimized out…

If you’re executing 15 fp operations, then the time to fetch foo from shared mem should make small difference. It’s probably what rostam said.

Thanks for your replies Rostam and Alex.
Excuse me for my slow reply time, as I am still transferring setting up my new PC.

I’ve compared the [tx] and [i] case ptx files, and found that the [tx] is about 76KB, while the [i] is about 133KB (the [tx+i] yelds about 60 GFLOP/s, but causes some fencing errors, so I am leaving it out). I am still having difficulty reading PTX files, but when I get some time, I’ll be looking more attentively to hopefully find what is causing this huge difference.

Uh, foo[tx] is constant over the whole loop so it will be read only once. You are comparing the speed of one shared memory access to the speed of BLOCK_X shared memory accesses. Also I have not re-read the programming guide but I am quite sure that this: “since foo is a 4 byte type, this should cause 4-way bank conflicts” is just wrong, since the “native” size for shared memory is 4 bytes.

Sorry about that; I meant to say 16-byte type.

As far as foo[tx]/foo[i] is concerned, thanks! You answered my question. I guess there’s no way to optimize that further.