Hi, I’m new in CUDA, and I encounter with a problem that the running cocurrent blocks per SM are not euqal to the theritical one. I have a device Tesla V100, compute capacity is 7.0, CUDA v10.2, and nsight systems is 2023.3.1.
And, there is a trial kernel
int warp = 32, w = 16;
__global__ void test_2U() {
__shared__ double tile[2 * w][warp + 1];
__shared__ double J[2 * w][2 * w + 1];
int tx = threadIdx.x;
int ty = threadIdx.y;
for(int i = 0; i < 1000; i++){
if(tx < 2 * w){
tile[ty][tx] = 2 * ty + tx;
J[ty][tx]= 3 * ty + tx;
}
__syncthreads();
}
}
The launch Statistics and calculated occupancy are following:
However, when I call the kernel 3 times in serial with different grid size, the result is not expected.
dim3 block_dim(warp, w);
test_2U << <80, block_dim, 0, stream[0] >> > ();
test_2U << <160, block_dim, 0, stream[0] >> > ();
test_2U << <320, block_dim, 0, stream[0] >> > ();
From my understanding, the active blocks per SM are restricted to three factors of the kernel, including registers, threads, and shared memory used. In theory, their running time should be closed while observing from the nsight systems, the time difference among the three is almost double. It seems that only one block is concurrent on one SM.
Is my understanding correct? Or maybe is there any other restrictions that caused the result?
if you are compiling for debug GPU code (-G) it’s generally more difficult to predict performance. If you are compiling without that switch, most of your code should be optimized away by the compiler; it is not affecting global state, again making performance predictions difficult.
Thanks for reply and I’m not compiling for debug GPU code. And I also tried measuring time through CUDA codes, where their running time still differ a lot. I’m still confused on it.
My env is linux. My compile command is /usr/local/cuda-10.2/bin/nvcc -ccbin g++ -m64 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_70,code=compute_70 -o test test.o, which is based on the sample of cuda codes.
When I compile your code on CUDA 12.2, I get some warnings. Even if you don’t see the warnings on CUDA 10.2, they are significant. The compiler will remove much of your code.
# cat t38.cu
const int warp = 32, w = 16;
__global__ void test_2U() {
__shared__ double tile[2 * w][warp + 1];
__shared__ double J[2 * w][2 * w + 1];
int tx = threadIdx.x;
int ty = threadIdx.y;
for(int i = 0; i < 1000; i++){
if(tx < 2 * w){
tile[ty][tx] = 2 * ty + tx;
J[ty][tx]= 3 * ty + tx;
}
__syncthreads();
}
}
int main(){
dim3 block_dim(warp, w);
test_2U << <80, block_dim>> > ();
test_2U << <160, block_dim>> > ();
test_2U << <320, block_dim>> > ();
cudaDeviceSynchronize();
}
root@hpe-dl385-gen10-005:~/bobc# nvcc -o t38 t38.cu
t38.cu(3): warning #550-D: variable "tile" was set but never used
__attribute__((shared)) double tile[2 * w][warp + 1];
^
Remark: The warnings can be suppressed with "-diag-suppress <warning-number>"
t38.cu(4): warning #550-D: variable "J" was set but never used
__attribute__((shared)) double J[2 * w][2 * w + 1];
^
#
nvcc is an optimizing compiler. It sees that the code setting tile and J is not affecting any user-visible outcome of the code, and will remove the body of the loops, at least (i.e. the entire if-statement in your kernel code). So that may or may not line up with what you are trying to do.
If we look at compiler explorer for your case, we see that there isn’t any SASS code to inspect - the compiler has optimized your kernel to something empty. In that case, you are probably not timing or measuring anything consistent with what you believe you are measuring. Your kernel code (i.e. each block) may exit before the block scheduler has a chance to deposit an additional block on an SM, so you may be measuring something like block scheduler throughput here.
(the code generation behavior of nvcc can change over time, from one release to another. So if you use a newer CUDA version, you may see something different - you can play with this in compiler explorer).
It means that after compiler’s optimization, the running time of one block for this kernel is so short that when block scheduler tries to deposit an additional block on the SM, the previous allocated block has already finished. So, there are no concurrent blocks on the same SM. Am I right?
If I’m right, I have a further question on it. If the kernel within one block lasts 1ms, will block scheduler has chance to deposit the addition block? Or what is the running time boundary of each block, will the scheduler has chance to make multiple blocks concurrent?
And another question is that I use the following code to measue running time:
cudaDeviceSynchronize()
start= clock();
end = clock();
double lib_time = (double)(end - start) / CLOCKS_PER_SEC;
Is it suitable?
Thanks for your patient explanation!
Yes, I was suggesting that may be happening. I haven’t run your case, and I don’t have a machine conveniently set up with a V100 and CUDA 10.2 at the moment. I personally would not want to spend my time deducing the exact low-level behavior of an empty kernel. It’s not an important case for me.
Yes, probably. 1 ms is probably more than 3 orders of magnitude longer than the run time of an “empty” kernel.
It’s not a method I typically use. I’m unfamiliar with clock(), but I have heard that it may vary in its behavior from one setting to another.
Thanks. And I’m sorry that I suddenly thought of an additional question about the active blocks. If there are 2 kernels running in different concurrent streams and the resources of one SM are enough for their calling together, can blocks from different kernels run on the same SM simultaneously?
Hi, Robert. I test the kernel within 1ms and it works. However, I found another issue that copying data from global memory to registers seems block the max blocks per SM.
Here is my test kernel:
// use prefetch technique to loop multiple times from GMEM to SMEM
__shared__ double tile[32][32];
int n = 4096;
double tile_pre[2];
tile[ty][tx] = dev_A[XX]; // independent addr among each threads and each block, same as below
tile[ty + 16][tx] = dev_A[XX];
__syncthreads();
#pragma unroll
for (int i = 0; i + 32 <= n; i += 32) {
// prefetch the data to registers
if (i + 32 + tx < n) {
tile_pre[0] = dev_A[XXX];
tile_pre[1] = dev_A[XXX];
}
// do some calculation for shared memory data
...
__syncthreads();
// fetch the prefetched data
if (i + 32 + tx < n) {
tile[ty][tx] = tile_pre[0];
tile[ty + 16][tx] = tile_pre[1];
}
__syncthreads();
}
The dim of block is (32, 16), and all blocks visit global memory without conflict. The number of max active blocks per SM for this kernel is 4. If n is set to 4096, it works well with running time not increased in the range of SMs * active blocks per SM. However, if n is set to 8192, it breaks down, where it seems only one block run on SM simultaneously.
Is it due to the fequency of visiting GMEM? Or anything I can improve my code.
Thanks!
I’m sorry that after I debug into details, my previous conclusion that only one block runs simultaneously is wrong.
My device contains 80 SMs, and I recompiled this kernel with 80, 160, 240, 320 blocks in turn. Observing from the running time, the running time is almost x, 2x, 3x, 4x. From my understanding, the larger n will only affect the time one block lasts but not the consistency of multiple blocks in the range of active blocks per SM. So, I simply get the previous conclusion.
For more details, I recorded the running time per block through clock64(), making blocks num from 80 to 160, and fixed n = 8192. The results show that the duration per block is consistent but increases twofold as the number of blocks approximately doubles. So, it seems that the factor is allocating more blocks leads to a longer duration for each individual blocks, but not that only one block runs on each SM at the same time.