So Total number of thread should be 16384. By launching this kernel, would be all thread kernel resident or active on SM? I think resident and active are different things. is it correct? So maybe all are resident but some are active?! I appreciate if you make this clear to me.
Do I need 1 SM or more? Can I put all of them in 1 SM and save other resources?
What happen if I launch 32 numbers of this kernel simultaneously?
In a part of CUDA documnts I am reading:
the GV100 SM provides 64 FP32 cores and 32 FP64 cores. The GV100 SM additionally includes 64 INT32 cores and 8 mixed-precision Tensor Cores. GV100 provides up to 84 SMs.
So here for performing float operation how many SM do I have? is SM and CORE a same thing? (84SMs vs 64 FP32 core).
GV100 GA100
Streaming Multiprocessors (SMs) 80/84 108/128
Max Thread Blocks Per SM 32 32
Max Warps Per SM 64 64
Max Threads per Thread Block 1024 1024
The language provides no guarantees that work will be co-resident. Resident and active are interchangeable as it relates to thread blocks (ctas), warps, and threads. The term active means that the schedule entity has all of its resources allocated and has been allocated to a warp scheduler on a Streaming Multiprocessor.
The language does not state. In practice the compute work distributor will distribute thread blocks (ctas) round robin to the the streaming multiprocessors (SM). Given a {32,1,1} grid only 32/80 or 32/108 SMs will be assigned a thread block. For the SMs that are assigned a thread block the {512,1,1} threads will be rasterized into 16 warps and assigned round robin to the SM sub-partitions (warp schedulers).
Launching 32 grids of the same kernel in the same stream the grids would executed in order. There would be no concurrency between grids. Only 32/80 or 32/108 SMs will be active. The other SMs will be idle.
Launching 32 grids of the same kernel in different streams will result in concurrent execution. You have not provided details to determine the warp occupancy. On a simple kernel (no shared memory, small number of registers per threads) thread blocks will be distributed from the first grid, second grid, … until the GPU is saturated. As thread blocks complete the compute work distributor will distribute new work to the SMs. There is no guarantee on the order that streams will be scheduled on the GPU.
The term “core” in the CUDA manual refers to an instruction execution pipeline not to a “cpu core”. At a first pass you are interested in making sure each SM has sufficient warps (threads) to hide latency. This often requires >1024 threads if the kernel is latency bound. If the kernel is compute bound only 256-512 threads per SM may be required.