the relation between Thread Index and Shared Memory

Miki · February 8, 2009, 10:04am

Hi,

any relation between Thread Index and Shared Memory Index ?(see 2)
In the programming guide it is writen that :“For devices of compute capability 1.x, the warp size is 32 and the number of banks
is 16 (see Section 5.1); a shared memory request for a warp is split into one request
for the first half of the warp and one request for the second half of the warp. As a
consequence, there can be no bank conflict between a thread belonging to the first
half of a warp and a thread belonging to the second half of the same warp.”
How that can be?
if I define shared int s_mem[2] , does that mean that the first half warp threads will use first s_mem[0] and the second halp warp will use the s_mem[1] ?
In the programming guide it is writen that :“A multiprocessor can execute as many as eight thread blocks
concurrently.” does that mean active blocks per each MP my GPU has?

Thanks
Miki

Quoc_Vinh · February 9, 2009, 11:41am

Hi,

any relation between Thread Index and Shared Memory Index ?(see 2)

In the programming guide it is writen that :"For devices of compute capability 1.x, the warp size is 32 and the number of banks

is 16 (see Section 5.1); a shared memory request for a warp is split into one request

for the first half of the warp and one request for the second half of the warp. As a

consequence, there can be no bank conflict between a thread belonging to the first

half of a warp and a thread belonging to the second half of the same warp."

How that can be?

if I define shared int s_mem[2] , does that mean that the first half warp threads will use first s_mem[0] and the second halp warp will use the s_mem[1] ?

In the programming guide it is writen that :"A multiprocessor can execute as many as eight thread blocks

concurrently." does that mean active blocks per each MP my GPU has?

Thanks

Miki

Yes. remember that shared memory is a memory and all threads in same block can accesses directly.
with your defined, your has created an integer array in each block, and this array has 2 integer elements. These 2 elements allocate in 2 banks of shared memory(bank0 and bank1). when you access the same bank with different thread (these threads must in the same the first half warp or same second half warp), the bank conflict will occurs. for example if thread0 access to the bank0(s_mem[0]) and the thread1 access to the bank0(s_mem[0]) the bank_conflict will occurs and if thread0 access to the bank0(s_mem[0]) and the thread8 access to the bank0(s_mem[0]) the bank_conflict will not occurs.
as far as i understand, in the first clock, the first eight threads in same warp will execute concurrently, and the fourth clock, the final eight threads in same warp will execute concurrently, so it takes 4 clocks for all threads within same warp.

sorry for my poor English.

Miki · February 9, 2009, 12:21pm

Yes. remember that shared memory is a memory and all threads in same block can accesses directly.

with your defined, your has created an integer array in each block, and this array has 2 integer elements. These 2 elements allocate in 2 banks of shared memory(bank0 and bank1). when you access the same bank with different thread (these threads must in the same the first half warp or same second half warp), the bank conflict will occurs. for example if thread0 access to the bank0(s_mem[0]) and the thread1 access to the bank0(s_mem[0]) the bank_conflict will occurs and if thread0 access to the bank0(s_mem[0]) and the thread8 access to the bank0(s_mem[0]) the bank_conflict will not occurs.

as far as i understand, in the first clock, the first eight threads in same warp will execute concurrently, and the fourth clock, the final eight threads in same warp will execute concurrently, so it takes 4 clocks for all threads within same warp.

sorry for my poor English.

Does that because shared mem request for one word(32bit) pull out 64bits per Half-Warp?

I wonder so if since then is better always to allocate shared mem in size multiply by 64 ?

Thanks

Miki

Quoc_Vinh · February 10, 2009, 2:03pm

I don’t understand your question clearly, but if you allocate shared mem in size multiply by 64, the bank conflicts will occurs, Without bank_conflict occurs, I think that you have better to read “5.1.2.5 shared memory” in the cuda programming guide carefully.

:)

Jamie_K · February 14, 2009, 9:50pm

any relation between Thread Index and Shared Memory Index ?(see 2)

In the programming guide it is writen that :"For devices of compute capability 1.x, the warp size is 32 and the number of banks

is 16 (see Section 5.1); a shared memory request for a warp is split into one request

for the first half of the warp and one request for the second half of the warp. As a

consequence, there can be no bank conflict between a thread belonging to the first

half of a warp and a thread belonging to the second half of the same warp."

How that can be?

if I define shared int s_mem[2] , does that mean that the first half warp threads will use first s_mem[0] and the second halp warp will use the s_mem[1] ?

In the programming guide it is writen that :"A multiprocessor can execute as many as eight thread blocks

concurrently." does that mean active blocks per each MP my GPU has?

I’m still learning but my understanding is:

No. Each thread may index whichever data from shared memory as specified in code. In the Programming Guide 2.1, page 7, each thread is using threadId.x as the index, but each thread could use whatever index it wanted to use. It doesn’t necessarily have to use the threadId. (It should probably be related to the threadId, or else all the threads would do the same thing and provide no parallel benefit.)
If two threads in the same warp request the same data (assuming no broadcasting) they will conflict and they will occur sequentially and take longer. I think of it sorta like making phone calls, where each bank can only accept one phone call at a time. If multiple threads place calls to the same bank, they will be forced to occur sequentially, in arbitrary order.

The half-warp thing means that the multiprocessor does not attempt all 32 access simultaneously. Instead, it attempts to make the first 16 accesses simultaneously. Some of these 16 accesses may have to be serialized if they conflict. Then, it attempts to make the remaining 16 accesses simultaneously. If one of the first group and one of the second group access the same bank, they do not conflict because it doesn’t attempt to make them simultaneously in the first place.

Yes, it means each multiprocessor can have eight active thread blocks. But I don’t believe they all execute on each cycle, I think they are time-sliced and therefore they share the compute power, just like they share the resources of memory and registers. At most 32 threads in the same warp can run truly concurrently, and any more than that, within a block or across multiple blocks, have to be time-sliced.

The reason for time-slicing between multiple blocks is if some blocks stall due to memory access, or synchronization or some other reason, the multiprocessor won’t sit idle. It will work on a different block. If not for that, there would be no advantage, because a single block could keep a multiprocessor completely busy.

That’s my understanding.

Topic		Replies	Views
dont understand bank conflicts for shared mem CUDA Programming and Performance	7	2629	March 31, 2010
How to understand the bank conflict of shared_mem CUDA Programming and Performance	12	9982	January 16, 2025
Does every thread block have its own 32 shared memory banks? CUDA Programming and Performance cuda	8	1630	February 6, 2023
Help understanding bank conflicts in transpose example CUDA Programming and Performance	5	6657	February 8, 2009
shared memory and threads question CUDA Programming and Performance	9	5930	August 30, 2007
Requesting clarification for Non contiguous shared memory access by threads of a warp with no bank conflicts CUDA Programming and Performance hw , cuda	5	394	February 21, 2024
Shared memory with compute capability 3.x (in 32-bit mode) or compute capability 5.x and 6.x CUDA Programming and Performance	5	974	November 17, 2017
Shared Memory "Bank Conflicts" I'am confused... CUDA Programming and Performance	11	3467	August 20, 2009
More threads/block increase kernel execution time. WHY? CUDA Programming and Performance	51	8280	June 17, 2011
Shared memory using structure instead of array CUDA Programming and Performance	7	1325	February 29, 2020

the relation between Thread Index and Shared Memory

Related topics