Hello everyone,
I’m currently learning CUDA programming from the book “Professional CUDA C Programming” by John Cheng. While I was practising with the example, came across a concept of execution and verification of CUDA kernels.
I quote what the book says
“you can set the execution configuration to <<<1,1>>>, so you force the kernel to run with only one block and one thread. This emulates a sequential implementation. This is useful for debugging and verifying correct results. Also, this helps you verify that numeric results are bitwise exact from run-to-run if you encounter order of operations issues.”(Page: 39)
If I understood the literal meaning, the paragraph is saying, if we launch a kernel using just 1 block with just 1 thread (i.e. <<<1, 1>>>), then the cuda code will run sequentially like a CPU. In such a way, we can check, if the kernel is functionally correct.
I have used the following kernel, allocating 32 elements for each array
_global_ void sumArraysOnGPU(float *A, float *B, float *C) {
int tid = threadIdx.x;
C[tid] = A[tid] + B[tid]; // Perform element-wise addition
}
// Launch kernel from Host
sumArraysOnGPU<<<1, 1>>>(d_A, d_B, d_C);
Now if we see the result, and the number of threads launched in the terminal
The output of the code confirms only one thread is launched, despite the vector size being 32. And sum between 0th index of d_A and d_B
is only performed and stored in the 0th index of d_C
(In Screenshot, Printed first 7 indices of the vector of GPU and CPU). But according to the book, it is supposed to be adding all 32 elements using just 1 thread in a sequential manner.
If I unrolled the “C[tid] = A[tid] + B[tid]” expression, probably it would make more sense. But in that chapter, the author said nothing about unrolling.
So did I understand the paragraph wrong? Or, is the paragraph itself wrong?
Thanks in advance!