the relation between Thread Index and Shared Memory

Hi,

  1. any relation between Thread Index and Shared Memory Index ?(see 2)

  2. In the programming guide it is writen that :“For devices of compute capability 1.x, the warp size is 32 and the number of banks
    is 16 (see Section 5.1); a shared memory request for a warp is split into one request
    for the first half of the warp and one request for the second half of the warp. As a
    consequence, there can be no bank conflict between a thread belonging to the first
    half of a warp and a thread belonging to the second half of the same warp.”
    How that can be?
    if I define shared int s_mem[2] , does that mean that the first half warp threads will use first s_mem[0] and the second halp warp will use the s_mem[1] ?

  3. In the programming guide it is writen that :“A multiprocessor can execute as many as eight thread blocks
    concurrently.” does that mean active blocks per each MP my GPU has?

Thanks
Miki

  1. Yes. remember that shared memory is a memory and all threads in same block can accesses directly.

  2. with your defined, your has created an integer array in each block, and this array has 2 integer elements. These 2 elements allocate in 2 banks of shared memory(bank0 and bank1). when you access the same bank with different thread (these threads must in the same the first half warp or same second half warp), the bank conflict will occurs. for example if thread0 access to the bank0(s_mem[0]) and the thread1 access to the bank0(s_mem[0]) the bank_conflict will occurs and if thread0 access to the bank0(s_mem[0]) and the thread8 access to the bank0(s_mem[0]) the bank_conflict will not occurs.

  3. as far as i understand, in the first clock, the first eight threads in same warp will execute concurrently, and the fourth clock, the final eight threads in same warp will execute concurrently, so it takes 4 clocks for all threads within same warp.

sorry for my poor English.

Does that because shared mem request for one word(32bit) pull out 64bits per Half-Warp?

I wonder so if since then is better always to allocate shared mem in size multiply by 64 ?

Thanks

Miki

I don’t understand your question clearly, but if you allocate shared mem in size multiply by 64, the bank conflicts will occurs, Without bank_conflict occurs, I think that you have better to read “5.1.2.5 shared memory” in the cuda programming guide carefully.

:)

I’m still learning but my understanding is:

  1. No. Each thread may index whichever data from shared memory as specified in code. In the Programming Guide 2.1, page 7, each thread is using threadId.x as the index, but each thread could use whatever index it wanted to use. It doesn’t necessarily have to use the threadId. (It should probably be related to the threadId, or else all the threads would do the same thing and provide no parallel benefit.)

  2. If two threads in the same warp request the same data (assuming no broadcasting) they will conflict and they will occur sequentially and take longer. I think of it sorta like making phone calls, where each bank can only accept one phone call at a time. If multiple threads place calls to the same bank, they will be forced to occur sequentially, in arbitrary order.

The half-warp thing means that the multiprocessor does not attempt all 32 access simultaneously. Instead, it attempts to make the first 16 accesses simultaneously. Some of these 16 accesses may have to be serialized if they conflict. Then, it attempts to make the remaining 16 accesses simultaneously. If one of the first group and one of the second group access the same bank, they do not conflict because it doesn’t attempt to make them simultaneously in the first place.

  1. Yes, it means each multiprocessor can have eight active thread blocks. But I don’t believe they all execute on each cycle, I think they are time-sliced and therefore they share the compute power, just like they share the resources of memory and registers. At most 32 threads in the same warp can run truly concurrently, and any more than that, within a block or across multiple blocks, have to be time-sliced.

The reason for time-slicing between multiple blocks is if some blocks stall due to memory access, or synchronization or some other reason, the multiprocessor won’t sit idle. It will work on a different block. If not for that, there would be no advantage, because a single block could keep a multiprocessor completely busy.

That’s my understanding.