Bank Conflicts and Serialized Warps

Basilios · December 3, 2009, 5:18pm

I’ve been running a few small kernels through the CUDA Visual Profiler to test my understanding of shared memory, and I’ve run into a bit of a problem.

I’m almost certain that my kernel should be causing 16-way bank conflicts, but according to the “warp_serialize” field in the profiler, this isn’t the case.

The kernel is listed below:

[codebox]global void bank_conflicts(float* array) {

extern __shared__ float cache[];

float tmp = cache[threadIdx.x * 16];

array[threadIdx.x] = tmp;

}[/codebox]

For each thread in a warp, (threadIdx.x * 16) % 16 will evaluate to 0, which I thought would force all threads to access the same shared memory bank. If this is the case, then all warps should be serialized.

However, if I run this kernel using a single block of 32 threads then the Visual Profiler reports that “warp_serialize” is equal to 120. I could almost understand if this value was 128 (each of the 32 threads reads 4 bytes) but that it is 120 has me completely baffled!

If anybody could shed light on the situation, it would be much appreciated!

EDIT: Another curious point of note is that the Visual Profiler reports that this kernel contains 5 branches, and that 1 of them is divergent. Does anybody know what’s going on?

_Big_Mac · December 3, 2009, 7:33pm

If all threads in a half-warp access a single address, shared mem goes into broadcast mode and delivers all data in a single cycle.

Basilios · December 3, 2009, 8:18pm

Maybe I’m being stupid, but how would that happen in this case?

The value of threadIdx.x is different for each thread, so cache[threadIdx.x * 16] will evaluate to a different memory location on each thread. My understanding is that this would result in them accessing the same bank, but with different addresses.

_Big_Mac · December 3, 2009, 9:18pm

Hm, that was a brainfart on my part. Disregard.

Gregory_Diamos · December 3, 2009, 10:02pm

Perhaps they don’t count the first thread when reporting on warp_serialize? Otherwise it would be at least 1 for any program.

_PM · December 3, 2009, 10:24pm

I thought the traditional trick was to allocate a shared array of n*17 to force offseted access.
[url=“http://people.maths.ox.ac.uk/~gilesm/hpc/NVIDIA/CUDA_Optimization_Harris.pdf”]http://people.maths.ox.ac.uk/~gilesm/hpc/N...tion_Harris.pdf[/url]
look p.41

EDIT: I read you post too fast. But if you did an allocation of cache with a size n %16 != 0 then it’s the reason you don’t have bank conflict.

Basilios · December 4, 2009, 9:23am

I can see why that would happen, but the amount of shared memory I’m allocating is defined as threads * 16, so size % 16 will always be 0.

(Thanks, by the way, for the tip about padding shared memory; I’m sure that will come in useful later.)

Topic		Replies	Views
What should I optimize first? Divergence? Serialized Warps? CUDA Programming and Performance	4	7229	April 7, 2009
dont understand bank conflicts for shared mem CUDA Programming and Performance	7	2629	March 31, 2010
cuda profiler reports high warp serialize CUDA Programming and Performance	5	2057	May 14, 2010
Unexplicable banks conflicts. Visual profiler and warp serialize. CUDA Programming and Performance	9	6037	May 19, 2011
bank conflict in cuda's parallel prefix scan GPU-Accelerated Libraries	1	1889	February 12, 2016
smem bank conflicts CUDA Programming and Performance	4	5039	September 30, 2008
Any ideas on how to get rid of warp serialize / coalescing here (vector x matrix)? CUDA Programming and Performance	1	1026	December 14, 2009
Problem with bank conflict. Something wrong with my experiment?Confused! CUDA Programming and Performance	4	1242	February 26, 2009
Requesting clarification for Non contiguous shared memory access by threads of a warp with no bank conflicts CUDA Programming and Performance hw , cuda	5	394	February 21, 2024
Shared memory bank conflict CUDA Programming and Performance	1	284	May 19, 2024

Bank Conflicts and Serialized Warps

Related topics