Unexplicable banks conflicts. Visual profiler and warp serialize.

bouvier · May 18, 2011, 3:42pm

Hi!

I have a very simple kernel and when i run it, according to the Visual Profiler, there are some bank conflicts.

I don’t understand where these bank conflicts are coming from.

The Visual Profiler tells me that there is 412 warp serialize when i run these kernel with 32 threads per block and 1 block(so only 1 warp). How is is possible?

__global__ void Cuda_Test(biguint_t *Aarg, biguint_t *Barg)

{

        __shared__ unsigned int r;

        r[threadIdx.x]=Aarg[blockIdx.x][threadIdx.x];

        unsigned int a=r[0];

        unsigned int b=Ncst[threadIdx.x]; //In constant memory

        unsigned int l=a*b;//r[threadIdx.y][0]*Ncst[threadIdx.x];

        Aarg[blockIdx.x][threadIdx.x]=l;

}

According to what i understand they should be a broadcast in the third line, and no bank conflict in the next (each thread in a half warp wants a 32bit piece of memory that should be in different bank). So where the warp serialize come from?

Thanks!

Cyril.

Gregory_Diamos · May 18, 2011, 4:37pm

Try commenting out lines until it goes away. I would suspect your read from constant memory. I don’t use the profiler much so I don’t know if it counts them, but constant memory can have bank conflicts.

tera · May 18, 2011, 6:58pm

Yes, nonuniform accesses to constant memory get serialized, at least on 1.x devices. So using [font=“Courier New”]threadIdx.x[/font] as index in constant memory is the worst possible case. You can use a texture instead if you want it cached on 1.x devices, otherwise just use global memory.

Gregory_Diamos · May 18, 2011, 7:53pm

Constant memory should really be renamed ‘Constant and Uniform Memory’. The right way to think about it is that it only applies to values that you would normally stick into the immediate field of an instruction. It is really only there to avoid recompiling kernels that need to change the some immediate values between successive executions of the same kernel.

tera · May 18, 2011, 8:13pm

Yes. Although thankfully it is going away anyway.

Do you know whether 2.x devices still have constant memory? The Programming Guide claims it’s still there, but I’ve yet to see the compiler actually use it - whenever I disassemble a 2.x cubin, the compiler has actually just placed everything declared constant in global memory.

Gregory_Diamos · May 18, 2011, 8:27pm

It is used mainly by graphics applications and I think that all GPU vendors include something similar to it, so I would assume that it is still in Fermi. Of course the compiler can choose not to use it, and without a way to program the driver/ISA explicitly, you may not be able to access it. You can declare it in PTX, although I don’t know if the lower layers of the compiler will convert it to global memory.

tera · May 18, 2011, 9:13pm

Oops, I lied. While I just tried to find an example I seem to remember where ptxas had moved data from constant to global memory I’ve come across an old sm_20 disassembly where constant memory is indeed used.

Thanks, Gregory!

bouvier · May 19, 2011, 7:44am

Hi!

Thanks the problem looks like it comes from the access to the constant memory.
What do you mean by non uniform access?

I have a kernel that do a lot of access to Ncst, it would slow down my kernel if i switch from constant to global memory, no?

Cyril

tera · May 19, 2011, 11:38am

With non-uniform access I mean that different threads of a warp (half-warp for 1.x devices) read different addresses. Your case is an extreme example for non-uniform access because using threadIdx.x as an index each thread will read different addresses, requiring 16 sequential reads instead of just one read whose result is broadcast to 16 threads.

What compute capability is your device? On 2.x devices you can just move the data to global memory, since that is cached as well and does not require uniform access. On 1.x you can use a texture in linear memory for the same effect.

bouvier · May 19, 2011, 12:29pm

Thanks for the explanation. I also find the section in the CUDA documentation that explain this issue (it is very well hidden in section 5.3.2.5).
I try my kernel on device of compute capability 1.x, I will look into texture in linear memory.

Topic		Replies	Views
Bank Conflicts and Serialized Warps CUDA Programming and Performance	6	7806	December 4, 2009
Accounting for warp serialisation CUDA Programming and Performance	5	718	April 8, 2011
Constant Arrays CUDA Programming and Performance	13	30498	November 24, 2007
cuda profiler reports high warp serialize CUDA Programming and Performance	5	2057	May 14, 2010
Bank Conflicts at Constant Memory CUDA Programming and Performance	2	1200	May 8, 2012
Constant memory 64KB only 8KB usable? CUDA Programming and Performance	13	7019	February 15, 2011
Global memory broadcast CUDA Programming and Performance	2	9073	July 4, 2011
What should I optimize first? Divergence? Serialized Warps? CUDA Programming and Performance	4	7229	April 7, 2009
Requesting clarification for Non contiguous shared memory access by threads of a warp with no bank conflicts CUDA Programming and Performance hw , cuda	5	394	February 21, 2024
dont understand bank conflicts for shared mem CUDA Programming and Performance	7	2629	March 31, 2010

Unexplicable banks conflicts. Visual profiler and warp serialize.

Related topics