are the threads of a warp run serially?

Dear All.
I have some questions.

  1. what is the real shared memory size of my gpu(titan v)?.
    I write a program with the parameter(sharedMemPerBlock) showing that my the shared memory storage capacity per block is 64kb. However I see the post that someone tell there are 96 KB shared memory on Volta(https://devtalk.nvidia.com/default/topic/1052021/cuda-programming-and-performance/shared-memory-size-per-thread-block/). However it is 128kb in the manual of nvidia and may it contains a unified data cache?.(https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#global-memory-6-x). So which is the true?

  2. are the threads of warp run serially? is warp the base scheduling unit in all the architecture?
    if it is, so do the threads of a warp run in a sp serially in hardware level?

  3. does the shared memory storage capacity per block mean that the shared memory storage capacity of per sm?
    so, if a block occupy the all shared memory, does the sm resident only a block?
    Also, suppose that one sm have 64 sp, and mean that it can execute 64 warp in parallel, so if my block size is 128 containing 4 warps and the block occupys the all shared memory, so does the sm just run only 4 threads parallel in hardware level?

  4. https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#application-compatibility.
    does the “The Volta architecture introduces Independent Thread Scheduling which changes the way threads are scheduled on the GPU. For code relying on specific behavior of SIMT scheduling in previous architecures, Independent Thread Scheduling may alter the set of participating threads, leading to incorrect results. To aid migration while implementing the corrective actions detailed in Independent Thread Scheduling, Volta developers can opt-in to Pascal’s thread scheduling with the compiler option combination -arch=compute_60 -code=sm_70.” means that the scheduling of threads will never be warp in the volta?

Excuse me for bad english. Thank you very much.

The CUDA Programming Guide (https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#compute-capabilities) contains a handy table in appendix H that shows the amount of shared memory available per thread block and per multiprocessor. The amounts vary with architecture (compute capability). None of the table entries shows 128 KB.

oh,thank you. i may make a mistake.
by the way, do the threads on a warp run serially in the volta architecture ?
I was testing that I allocate all the shared memory in a block, but no performance degradation occur.

1. what is the real shared memory size of my gpu(titan v)?.

The Volta architecture has unified data cache that can be partitioned between shared memory and tagged RAM. This is covered in the CUDA Programming Guide H.6 Compute Capability 7.x subsection .4 Shared Memory.

https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#shared-memory-7-x

The size of the unified data cache is 128 KiB. The shared memory capacity can be set to 0 KiB, 8 KiB, 16 KiB, 32 KiB, 64 KiB, or 96 KiB.

The maximum size of shared memory per block is 48 KiB by default. The developer can call cudaFuncSetAttribute to explicitly exceed this allowing a single block to access all 96 KiB. See link above for references to all API calls.

2. are the threads of warp run serially? is warp the base scheduling unit in all the architecture?

Each Volta SM has 4 sub-partitions. Each SM sub-partition has a warp scheduler, a register file, and multiple dedicated execution pipelines. Each cycle the warp scheduler selects an eligible warp (not stalled warp) from the set of active warps and dispatches the selected warp to the execution pipeline determined by the instruction. Some execution pipelines are not 32 lanes wide. If the pipeline is not 32 lanes wide the warp is dispatched over multiple cycles. If an instruction dispatches over multiple cycles then the warp scheduler can dispatch an eligible warp (can be the same warp) to a different execution pipe on the next cycle.

3. does the shared memory storage capacity per block mean that the shared memory storage capacity of per sm?

No. The amount of shared memory per block and the amount of shared memory per SM are controlled through separate mechanisms. See the link above and the sections in the CUDA programming guide related to shared memory and warp occupancy.

Occupancy Calculator API https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#occupancy-calculator

Occupancy Calculator XLSX https://docs.nvidia.com/cuda/cuda-occupancy-calculator/CUDA_Occupancy_Calculator.xls

The XLSX will allow you to independently set the SM shared memory size on Volta and the shared memory per thread block.

4. Does the "The Volta architecture introduces Independent Thread Scheduling which changes the way threads are scheduled on the GPU.

Yes, the independent thread scheduler can change the order of execution of divergent execution paths.

Thank you!

The Occupancy is very low when I use the Occupancy Calculator XLSX to see the Occupancy.
Such as:
Threads Per Block 256
Registers Per Thread 46
Shared Memory Per Block (bytes) 98304

showing that
Active Threads per Multiprocessor 256
Active Warps per Multiprocessor 8
Active Thread Blocks per Multiprocessor 1
Occupancy of each Multiprocessor 13%

does it mean that the more shared memory a block uses, the fewer blocks it will reside in sm?
For example, a block occupies 96kb of shared memory.
At this time, only 8 block resides in one sm? Does it mean that a large computing core is not working in one sm?(does only 13% Computing Unit work?), so does it is very inefficient?

besides, I have encountered a strange phenomenon as following.
I use shared memory to non-cavoid coalesced access(data are write to shared memory from global memory and read out from shared memory to avoid non-coalesced access and one block takes up almost all shared memory). As the data shown from the Occupancy Calculator XLSX, the Occupancy of each Multiprocessor just only 13%, but my program’s efficiency has improved.
so, is it mean that the only 13% Computing Unit work with coalesced access is greater than 100% Computing Unit work with non-coalesced access?

sorry,
the “At this time, only 8 block resides in one sm? Does it mean that a large computing core is not working in one sm?(does only 13% Computing Unit work?), so does it is very inefficient?” above
Change to
“At this time, only 1 block resides in one sm? Does it mean that a large computing core is not working in one sm?(does only 13% Computing Unit work?), so does it is very inefficient?”

If a single block uses 96kb of shared memory, only 1 block will fit on the volta SM. The warps of that block (assuming there are 4 or more) have access to all the resources in the SM, including all the execution pipes.

But you have severely limited the capacity of the machine to hide latency, if the maximum complement (occupancy) is 256 threads due to your shared mem usage, when it could be 2048 threads otherwise. (256/2048 = 13%)

Severely limiting the ability of the machine to hide latency may result in lower performance, which is generally why people consider occupancy.

Nonetheless, is it sometimes more efficient to use more shared memory than to increase occupancy?

Additional information on the kernels global memory access patterns, shared memory access patterns, data re-use, and computation would be needed to provide additional feedback. Ideally this is provide as a description of the problem, current solution, and a minimal reproducible.

Increasing the number of active warps per cycle will increases the ability for the warp schedulers to hide latency. Using a CUDA profiler such as Nsight Compute can help determine if the kernel is compute bound, memory latency bound, or memory bandwidth bound.

There are definitely optimization edge cases where limiting warps/SM in order to use a high number of registers per thread (128 registers/thread limits to 16 warps/SM, 255 registers/thread limits to 8 warps/SM) or for high shared memory. Without the additional information it is hard to provide additional feedback.

Thank you very much. I will think again about my program memory hierarchy.