Theoritical active blocks per SM differ from the ones actually running at runtime

3040545459 · September 7, 2023, 1:17pm

Hi, I’m new in CUDA, and I encounter with a problem that the running cocurrent blocks per SM are not euqal to the theritical one. I have a device Tesla V100, compute capacity is 7.0, CUDA v10.2, and nsight systems is 2023.3.1.
And, there is a trial kernel
int warp = 32, w = 16;
__global__ void test_2U() {
__shared__ double tile[2 * w][warp + 1];
__shared__ double J[2 * w][2 * w + 1];
int tx = threadIdx.x;
int ty = threadIdx.y;

for(int i = 0; i < 1000; i++){
    if(tx < 2 * w){
        tile[ty][tx] = 2 * ty + tx;
        J[ty][tx]= 3 * ty + tx; 
    }
    __syncthreads();
}

}
The launch Statistics and calculated occupancy are following:

However, when I call the kernel 3 times in serial with different grid size, the result is not expected.
dim3 block_dim(warp, w);
test_2U << <80, block_dim, 0, stream[0] >> > ();
test_2U << <160, block_dim, 0, stream[0] >> > ();
test_2U << <320, block_dim, 0, stream[0] >> > ();

From my understanding, the active blocks per SM are restricted to three factors of the kernel, including registers, threads, and shared memory used. In theory, their running time should be closed while observing from the nsight systems, the time difference among the three is almost double. It seems that only one block is concurrent on one SM.

Is my understanding correct? Or maybe is there any other restrictions that caused the result?

Thanks!

Robert_Crovella · September 7, 2023, 1:28pm

if you are compiling for debug GPU code (-G) it’s generally more difficult to predict performance. If you are compiling without that switch, most of your code should be optimized away by the compiler; it is not affecting global state, again making performance predictions difficult.

3040545459 · September 7, 2023, 1:47pm

Thanks for reply and I’m not compiling for debug GPU code. And I also tried measuring time through CUDA codes, where their running time still differ a lot. I’m still confused on it.

Robert_Crovella · September 7, 2023, 1:56pm

are you on windows? If so are you building a release project or are you building a debug project?

3040545459 · September 7, 2023, 2:00pm

My env is linux. My compile command is /usr/local/cuda-10.2/bin/nvcc -ccbin g++ -m64 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_70,code=compute_70 -o test test.o, which is based on the sample of cuda codes.

Robert_Crovella · September 7, 2023, 2:14pm

When I compile your code on CUDA 12.2, I get some warnings. Even if you don’t see the warnings on CUDA 10.2, they are significant. The compiler will remove much of your code.

# cat t38.cu
const int warp = 32, w = 16;
__global__ void test_2U() {
  __shared__ double tile[2 * w][warp + 1];
  __shared__ double J[2 * w][2 * w + 1];
  int tx = threadIdx.x;
  int ty = threadIdx.y;

  for(int i = 0; i < 1000; i++){
    if(tx < 2 * w){
        tile[ty][tx] = 2 * ty + tx;
        J[ty][tx]= 3 * ty + tx;
    }
    __syncthreads();
  }
}

int main(){

  dim3 block_dim(warp, w);
  test_2U << <80, block_dim>> > ();
  test_2U << <160, block_dim>> > ();
  test_2U << <320, block_dim>> > ();
  cudaDeviceSynchronize();
}
root@hpe-dl385-gen10-005:~/bobc# nvcc -o t38 t38.cu
t38.cu(3): warning #550-D: variable "tile" was set but never used
    __attribute__((shared)) double tile[2 * w][warp + 1];
                                   ^

Remark: The warnings can be suppressed with "-diag-suppress <warning-number>"

t38.cu(4): warning #550-D: variable "J" was set but never used
    __attribute__((shared)) double J[2 * w][2 * w + 1];
                                   ^

#

nvcc is an optimizing compiler. It sees that the code setting tile and J is not affecting any user-visible outcome of the code, and will remove the body of the loops, at least (i.e. the entire if-statement in your kernel code). So that may or may not line up with what you are trying to do.

If we look at compiler explorer for your case, we see that there isn’t any SASS code to inspect - the compiler has optimized your kernel to something empty. In that case, you are probably not timing or measuring anything consistent with what you believe you are measuring. Your kernel code (i.e. each block) may exit before the block scheduler has a chance to deposit an additional block on an SM, so you may be measuring something like block scheduler throughput here.

(the code generation behavior of nvcc can change over time, from one release to another. So if you use a newer CUDA version, you may see something different - you can play with this in compiler explorer).

3040545459 · September 7, 2023, 3:22pm

It means that after compiler’s optimization, the running time of one block for this kernel is so short that when block scheduler tries to deposit an additional block on the SM, the previous allocated block has already finished. So, there are no concurrent blocks on the same SM. Am I right?

If I’m right, I have a further question on it. If the kernel within one block lasts 1ms, will block scheduler has chance to deposit the addition block? Or what is the running time boundary of each block, will the scheduler has chance to make multiple blocks concurrent?
And another question is that I use the following code to measue running time:
cudaDeviceSynchronize()
start= clock();
end = clock();
double lib_time = (double)(end - start) / CLOCKS_PER_SEC;
Is it suitable?
Thanks for your patient explanation!

Robert_Crovella · September 7, 2023, 3:37pm

Yes, I was suggesting that may be happening. I haven’t run your case, and I don’t have a machine conveniently set up with a V100 and CUDA 10.2 at the moment. I personally would not want to spend my time deducing the exact low-level behavior of an empty kernel. It’s not an important case for me.

Yes, probably. 1 ms is probably more than 3 orders of magnitude longer than the run time of an “empty” kernel.

It’s not a method I typically use. I’m unfamiliar with clock(), but I have heard that it may vary in its behavior from one setting to another.

I generally use a method like this.

3040545459 · September 7, 2023, 3:45pm

Thanks. And I’m sorry that I suddenly thought of an additional question about the active blocks. If there are 2 kernels running in different concurrent streams and the resources of one SM are enough for their calling together, can blocks from different kernels run on the same SM simultaneously?

Robert_Crovella · September 7, 2023, 4:00pm

yes, its possible

3040545459 · September 7, 2023, 4:02pm

Thank you for taking the time to answer my question!

3040545459 · September 10, 2023, 12:01pm

Hi, Robert. I test the kernel within 1ms and it works. However, I found another issue that copying data from global memory to registers seems block the max blocks per SM.
Here is my test kernel:
// use prefetch technique to loop multiple times from GMEM to SMEM
__shared__ double tile[32][32];
int n = 4096;
double tile_pre[2];

tile[ty][tx] = dev_A[XX]; // independent addr among each threads and each block, same as below
tile[ty + 16][tx] = dev_A[XX];
__syncthreads();

#pragma unroll
for (int i = 0; i + 32 <= n; i += 32) {
        // prefetch the data to registers
        if (i + 32 + tx < n) {
            tile_pre[0] = dev_A[XXX];
            tile_pre[1] = dev_A[XXX];
        }

        // do some calculation for shared memory data
        ...
        __syncthreads();

    // fetch the prefetched data
        if (i + 32 + tx < n) {
            tile[ty][tx] = tile_pre[0];
            tile[ty + 16][tx] = tile_pre[1];
        }
    __syncthreads();
        
}

The dim of block is (32, 16), and all blocks visit global memory without conflict. The number of max active blocks per SM for this kernel is 4. If n is set to 4096, it works well with running time not increased in the range of SMs * active blocks per SM. However, if n is set to 8192, it breaks down, where it seems only one block run on SM simultaneously.
Is it due to the fequency of visiting GMEM? Or anything I can improve my code.
Thanks!

njuffa · September 10, 2023, 6:15pm

How did you reach that conclusion?

3040545459 · September 11, 2023, 12:06am

I’m sorry that after I debug into details, my previous conclusion that only one block runs simultaneously is wrong.
My device contains 80 SMs, and I recompiled this kernel with 80, 160, 240, 320 blocks in turn. Observing from the running time, the running time is almost x, 2x, 3x, 4x. From my understanding, the larger n will only affect the time one block lasts but not the consistency of multiple blocks in the range of active blocks per SM. So, I simply get the previous conclusion.
For more details, I recorded the running time per block through clock64(), making blocks num from 80 to 160, and fixed n = 8192. The results show that the duration per block is consistent but increases twofold as the number of blocks approximately doubles. So, it seems that the factor is allocating more blocks leads to a longer duration for each individual blocks, but not that only one block runs on each SM at the same time.

system · September 25, 2023, 12:06am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
More blocks than SMs may not make sense CUDA Programming and Performance	13	2915	November 11, 2010
What resources are needed for a block to run? CUDA Programming and Performance	9	3276	May 21, 2009
Basic Cuda Confusion - help CUDA Programming and Performance	9	2025	February 11, 2013
random error when more than 1 active block do you recognize this? CUDA Programming and Performance	10	9366	April 27, 2010
What will be happen in the situation CUDA Programming and Performance	9	6359	December 23, 2008
Doubts related to CUDA CUDA Programming and Performance	17	11951	November 18, 2010
Different number of blocks results in varying running times CUDA Programming and Performance	2	247	January 10, 2024
Thread block num problem CUDA Programming and Performance	8	144	November 13, 2024
Factors impact latency of two concurrency cuda kernels CUDA Programming and Performance cuda	4	420	November 3, 2023
Thread Block Scheduler uses disjoint SMs for 2 kernels in separate streams CUDA Programming and Performance	3	186	February 3, 2025

Theoritical active blocks per SM differ from the ones actually running at runtime

Related topics