Does launching a kernel with just 1 thread with 1 block ensures sequential execution in GPU?

Hello everyone,

I’m currently learning CUDA programming from the book “Professional CUDA C Programming” by John Cheng. While I was practising with the example, came across a concept of execution and verification of CUDA kernels.

I quote what the book says

“you can set the execution configuration to <<<1,1>>>, so you force the kernel to run with only one block and one thread. This emulates a sequential implementation. This is useful for debugging and verifying correct results. Also, this helps you verify that numeric results are bitwise exact from run-to-run if you encounter order of operations issues.”(Page: 39)

If I understood the literal meaning, the paragraph is saying, if we launch a kernel using just 1 block with just 1 thread (i.e. <<<1, 1>>>), then the cuda code will run sequentially like a CPU. In such a way, we can check, if the kernel is functionally correct.

I have used the following kernel, allocating 32 elements for each array

_global_ void sumArraysOnGPU(float *A, float *B, float *C) {

    int tid = threadIdx.x;
    
    C[tid] = A[tid] + B[tid]; // Perform element-wise addition
}

// Launch kernel from Host
sumArraysOnGPU<<<1, 1>>>(d_A, d_B, d_C); 

Now if we see the result, and the number of threads launched in the terminal

Screenshot from 2023-12-14 22-45-17

The output of the code confirms only one thread is launched, despite the vector size being 32. And sum between 0th index of d_A and d_B is only performed and stored in the 0th index of d_C (In Screenshot, Printed first 7 indices of the vector of GPU and CPU). But according to the book, it is supposed to be adding all 32 elements using just 1 thread in a sequential manner.

If I unrolled the “C[tid] = A[tid] + B[tid]” expression, probably it would make more sense. But in that chapter, the author said nothing about unrolling.

So did I understand the paragraph wrong? Or, is the paragraph itself wrong?

Thanks in advance!

Yes, a single thread kernel launch: <<<1,1>>> will guarantee C++ style sequential execution of your kernel code. The support for this claim comes from CUDA’s claimed adherence to C++.

However, that does not mean that the results will always be identical compared to any other kernel launch configuration.

When I am teaching CUDA, I usually advise people to hang onto their C++ knowledge. Don’t flush that and assume that everything is different, and there is weirdness lurking everywhere.

When you only launch one thread, that thread runs your kernel code, and then exits. It does not magically run it multiple times. (Pretty much like my C++ knowledge suggests to me. Just like the behavior of a C++ function call.)

If you want to write or design a CUDA kernel that will do the same amount of work regardless of grid size or kernel launch configuration, then the grid-stride loop methodology may be of interest.

But without a loop of some sort in your kernel code, the code will only be executed once, per thread. And if that code that is executed once only updates one location, then only one result or location will be updated/computed, if you launch with <<<1,1>>>.

In C++, I have never heard of unrolling applied to anything other than a loop. There is no loop there.

1 Like

Thank you for your response. I have some questions about your answer to make my understanding clear

Q1. Regarding the grid-stride loop methodology, am I correct in understanding that this technique allows a kernel to process the entire dataset efficiently, even when the kernel configuration is <<<1, 1>>>?

Q2: Regarding kernel verification using a single-thread configuration <<<1, 1>>>(from the book), am I correct to assume that this only checks the correctness of the kernel’s computation for a single element of the vector, rather than the entire vector?

Q3: Understood, since the ‘C[tid] = A[tid] + B[tid]’ is a single operation and not a loop, unrolling isn’t applicable. However, if I were to introduce a loop inside the kernel to handle multiple elements with a single thread, could loop unrolling be used in that scenario to optimize performance?

This article on grid-stride loops has helped me understand the fundamentals and benefits of the concept. However, I’ve asked the above questions to completely clear up any remaining confusion.

Thanks

No. It’s not necessarily efficient. <<<1,1>>> is never efficient. Ever. Ever. But it will process the entire dataset, or whatever you have set the for-loop limit to.

Yes, that seems like a reasonable summary to me. You have already proven to yourself that it does not, in the general case, process the entire dataset.

Loop unrolling can help with performance in various cases. The compiler will aggressively unroll loops for this purpose. It doesn’t generally require explicit help or instruction from you to do so.

Thank you for the explanation. I now have a clear understanding of the <<<1,1>>> configuration.

As for loop unrolling, I need to study further to understand its application better. For the time being, I’ll proceed with exploring other aspects of CUDA programming.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.