are the threads of a warp run serially？

742820157 · February 26, 2020, 12:05pm

Dear All.
I have some questions.

what is the real shared memory size of my gpu(titan v)?.
I write a program with the parameter(sharedMemPerBlock) showing that my the shared memory storage capacity per block is 64kb. However I see the post that someone tell there are 96 KB shared memory on Volta(https://devtalk.nvidia.com/default/topic/1052021/cuda-programming-and-performance/shared-memory-size-per-thread-block/). However it is 128kb in the manual of nvidia and may it contains a unified data cache?.(Programming Guide :: CUDA Toolkit Documentation). So which is the true?
are the threads of warp run serially？ is warp the base scheduling unit in all the architecture？
if it is, so do the threads of a warp run in a sp serially in hardware level?
does the shared memory storage capacity per block mean that the shared memory storage capacity of per sm?
so, if a block occupy the all shared memory, does the sm resident only a block?
Also, suppose that one sm have 64 sp, and mean that it can execute 64 warp in parallel, so if my block size is 128 containing 4 warps and the block occupys the all shared memory, so does the sm just run only 4 threads parallel in hardware level?
Programming Guide :: CUDA Toolkit Documentation.
does the “The Volta architecture introduces Independent Thread Scheduling which changes the way threads are scheduled on the GPU. For code relying on specific behavior of SIMT scheduling in previous architecures, Independent Thread Scheduling may alter the set of participating threads, leading to incorrect results. To aid migration while implementing the corrective actions detailed in Independent Thread Scheduling, Volta developers can opt-in to Pascal’s thread scheduling with the compiler option combination -arch=compute_60 -code=sm_70.” means that the scheduling of threads will never be warp in the volta?

Excuse me for bad english. Thank you very much.

njuffa · February 26, 2020, 5:32pm

The CUDA Programming Guide (https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#compute-capabilities) contains a handy table in appendix H that shows the amount of shared memory available per thread block and per multiprocessor. The amounts vary with architecture (compute capability). None of the table entries shows 128 KB.

742820157 · February 27, 2020, 1:15am

oh,thank you. i may make a mistake.
by the way, do the threads on a warp run serially in the volta architecture ?
I was testing that I allocate all the shared memory in a block, but no performance degradation occur.

Greg · February 27, 2020, 4:28am

1. what is the real shared memory size of my gpu(titan v)?.

The Volta architecture has unified data cache that can be partitioned between shared memory and tagged RAM. This is covered in the CUDA Programming Guide H.6 Compute Capability 7.x subsection .4 Shared Memory.

https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#shared-memory-7-x

The size of the unified data cache is 128 KiB. The shared memory capacity can be set to 0 KiB, 8 KiB, 16 KiB, 32 KiB, 64 KiB, or 96 KiB.

The maximum size of shared memory per block is 48 KiB by default. The developer can call cudaFuncSetAttribute to explicitly exceed this allowing a single block to access all 96 KiB. See link above for references to all API calls.

2. are the threads of warp run serially？ is warp the base scheduling unit in all the architecture？

Each Volta SM has 4 sub-partitions. Each SM sub-partition has a warp scheduler, a register file, and multiple dedicated execution pipelines. Each cycle the warp scheduler selects an eligible warp (not stalled warp) from the set of active warps and dispatches the selected warp to the execution pipeline determined by the instruction. Some execution pipelines are not 32 lanes wide. If the pipeline is not 32 lanes wide the warp is dispatched over multiple cycles. If an instruction dispatches over multiple cycles then the warp scheduler can dispatch an eligible warp (can be the same warp) to a different execution pipe on the next cycle.

3. does the shared memory storage capacity per block mean that the shared memory storage capacity of per sm?

No. The amount of shared memory per block and the amount of shared memory per SM are controlled through separate mechanisms. See the link above and the sections in the CUDA programming guide related to shared memory and warp occupancy.

Occupancy Calculator API https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#occupancy-calculator

Occupancy Calculator XLSX https://docs.nvidia.com/cuda/cuda-occupancy-calculator/CUDA_Occupancy_Calculator.xls

The XLSX will allow you to independently set the SM shared memory size on Volta and the shared memory per thread block.

4. Does the "The Volta architecture introduces Independent Thread Scheduling which changes the way threads are scheduled on the GPU.

Yes, the independent thread scheduler can change the order of execution of divergent execution paths.

742820157 · February 27, 2020, 6:05am

Greg @ NV:

1. what is the real shared memory size of my gpu(titan v)?.

The Volta architecture has unified data cache that can be partitioned between shared memory and tagged RAM. This is covered in the CUDA Programming Guide H.6 Compute Capability 7.x subsection .4 Shared Memory.

https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#shared-memory-7-x

The size of the unified data cache is 128 KiB. The shared memory capacity can be set to 0 KiB, 8 KiB, 16 KiB, 32 KiB, 64 KiB, or 96 KiB.

The maximum size of shared memory per block is 48 KiB by default. The developer can call cudaFuncSetAttribute to explicitly exceed this allowing a single block to access all 96 KiB. See link above for references to all API calls.

2. are the threads of warp run serially？ is warp the base scheduling unit in all the architecture？

Each Volta SM has 4 sub-partitions. Each SM sub-partition has a warp scheduler, a register file, and multiple dedicated execution pipelines. Each cycle the warp scheduler selects an eligible warp (not stalled warp) from the set of active warps and dispatches the selected warp to the execution pipeline determined by the instruction. Some execution pipelines are not 32 lanes wide. If the pipeline is not 32 lanes wide the warp is dispatched over multiple cycles. If an instruction dispatches over multiple cycles then the warp scheduler can dispatch an eligible warp (can be the same warp) to a different execution pipe on the next cycle.

3. does the shared memory storage capacity per block mean that the shared memory storage capacity of per sm?

No. The amount of shared memory per block and the amount of shared memory per SM are controlled through separate mechanisms. See the link above and the sections in the CUDA programming guide related to shared memory and warp occupancy.

Occupancy Calculator API https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#occupancy-calculator

Occupancy Calculator XLSX https://docs.nvidia.com/cuda/cuda-occupancy-calculator/CUDA_Occupancy_Calculator.xls

The XLSX will allow you to independently set the SM shared memory size on Volta and the shared memory per thread block.

4. Does the "The Volta architecture introduces Independent Thread Scheduling which changes the way threads are scheduled on the GPU.

Yes, the independent thread scheduler can change the order of execution of divergent execution paths.

Thank you!

The Occupancy is very low when I use the Occupancy Calculator XLSX to see the Occupancy.
Such as:
Threads Per Block 256
Registers Per Thread 46
Shared Memory Per Block (bytes) 98304

showing that
Active Threads per Multiprocessor 256
Active Warps per Multiprocessor 8
Active Thread Blocks per Multiprocessor 1
Occupancy of each Multiprocessor 13%

does it mean that the more shared memory a block uses, the fewer blocks it will reside in sm?
For example, a block occupies 96kb of shared memory.
At this time, only 8 block resides in one sm? Does it mean that a large computing core is not working in one sm?(does only 13% Computing Unit work?), so does it is very inefficient?

besides, I have encountered a strange phenomenon as following.
I use shared memory to non-cavoid coalesced access（data are write to shared memory from global memory and read out from shared memory to avoid non-coalesced access and one block takes up almost all shared memory). As the data shown from the Occupancy Calculator XLSX, the Occupancy of each Multiprocessor just only 13%, but my program’s efficiency has improved.
so, is it mean that the only 13% Computing Unit work with coalesced access is greater than 100% Computing Unit work with non-coalesced access?

742820157 · February 27, 2020, 6:33am

sorry,
the “At this time, only 8 block resides in one sm? Does it mean that a large computing core is not working in one sm?(does only 13% Computing Unit work?), so does it is very inefficient?” above
Change to
“At this time, only 1 block resides in one sm? Does it mean that a large computing core is not working in one sm?(does only 13% Computing Unit work?), so does it is very inefficient?”

Robert_Crovella · February 27, 2020, 10:53am

If a single block uses 96kb of shared memory, only 1 block will fit on the volta SM. The warps of that block (assuming there are 4 or more) have access to all the resources in the SM, including all the execution pipes.

But you have severely limited the capacity of the machine to hide latency, if the maximum complement (occupancy) is 256 threads due to your shared mem usage, when it could be 2048 threads otherwise. (256/2048 = 13%)

Severely limiting the ability of the machine to hide latency may result in lower performance, which is generally why people consider occupancy.

742820157 · February 27, 2020, 11:16am

Nonetheless, is it sometimes more efficient to use more shared memory than to increase occupancy?

Greg · February 27, 2020, 2:31pm

Additional information on the kernels global memory access patterns, shared memory access patterns, data re-use, and computation would be needed to provide additional feedback. Ideally this is provide as a description of the problem, current solution, and a minimal reproducible.

Increasing the number of active warps per cycle will increases the ability for the warp schedulers to hide latency. Using a CUDA profiler such as Nsight Compute can help determine if the kernel is compute bound, memory latency bound, or memory bandwidth bound.

There are definitely optimization edge cases where limiting warps/SM in order to use a high number of registers per thread (128 registers/thread limits to 16 warps/SM, 255 registers/thread limits to 8 warps/SM) or for high shared memory. Without the additional information it is hard to provide additional feedback.

742820157 · February 27, 2020, 2:56pm

Greg @ NV:

Additional information on the kernels global memory access patterns, shared memory access patterns, data re-use, and computation would be needed to provide additional feedback. Ideally this is provide as a description of the problem, current solution, and a minimal reproducible.

Increasing the number of active warps per cycle will increases the ability for the warp schedulers to hide latency. Using a CUDA profiler such as Nsight Compute can help determine if the kernel is compute bound, memory latency bound, or memory bandwidth bound.

There are definitely optimization edge cases where limiting warps/SM in order to use a high number of registers per thread (128 registers/thread limits to 16 warps/SM, 255 registers/thread limits to 8 warps/SM) or for high shared memory. Without the additional information it is hard to provide additional feedback.

Thank you very much. I will think again about my program memory hierarchy.

Topic		Replies	Views
Block size and occupancy CUDA Programming and Performance	12	70	January 2, 2025
Efficient use of shared memory CUDA Programming and Performance	29	4377	December 2, 2019
Amount of Shared Memory CUDA Programming and Performance	10	4194	June 3, 2010
CUDA Use Cases run serial algorithms on composite data CUDA Programming and Performance	14	4490	October 24, 2008
Occupancy wierdness.... Is the calculator wrong? CUDA Programming and Performance	5	5900	July 25, 2007
Code optimization with CDP and dynamic shared memory allocation CUDA Programming and Performance	18	47	January 13, 2025
Doubts related to CUDA CUDA Programming and Performance	17	11805	November 18, 2010
Warp switching does anybody understands the mechanism CUDA Programming and Performance	16	8481	March 28, 2008
CUDA Occupancy Calculator Helps pick optimal thread block size CUDA Programming and Performance	76	312101	September 13, 2011
Using Shared Memory in CUDA C/C++ Technical Blog	36	1969	October 8, 2020

are the threads of a warp run serially？

Related topics