I wanted to print the following number pyramid in CUDA:

1
2 2
3 3 3
4 4 4 4
5 5 5 5 5

The follwing listing is the CPU version which works properly.

Host-only listing:

#include <stdio.h>
#define N 5 // Number of rows in the pyramid
void printPyramid()
{
for (int tid = 0; tid < N; tid++)
{
// Calculate the number of spaces before the current row
int numSpaces = N - tid - 1;
// Calculate the starting number for the current row
int startNum = tid + 1;
// Print the spaces before the numbers
for (int i = 0; i < numSpaces; i++)
{
printf(" ");
}
// Print the numbers for the current row
for (int i = 0; i < startNum; i++)
{
printf("%d ", startNum);
}
printf("\n");
}
}
int main()
{
printPyramid();
return 0;
}

However, my CUDA listing doesn’t work as expected and doesn’t produce the proper output:

#include <stdio.h>
#define N 5 // Number of rows in the pyramid
__global__ void printPyramid()
{
int tid = threadIdx.x;
// Calculate the number of spaces before the current row
int numSpaces = N - tid - 1;
// Calculate the starting number for the current row
int startNum = tid + 1;
// Print the spaces before the numbers
for (int i = 0; i < numSpaces; i++)
{
printf(" ");
}
// Print the numbers for the current row
for (int i = 0; i < startNum; i++)
{
printf("%d ", startNum);
}
printf("\n");
}
int main()
{
// Launch the CUDA kernel with 1 block and N threads per block
printPyramid<<<1, N>>>();
// Wait for the kernel to finish
cudaDeviceSynchronize();
// Check for any errors during the kernel launch
cudaError_t error = cudaGetLastError();
if (error != cudaSuccess)
{
printf("CUDA error: %s\n", cudaGetErrorString(error));
return 1;
}
return 0;
}

CUDA doesn’t impose any particular thread execution ordering (unless you impose it). Therefore we would have to approach such a problem carefully to write “proper” code. We’d like to avoid using extensive serialization of threads in a warp; if we had extensive serialization there (which is certainly one way to solve a number of issues) we might just as well write a single-threaded realization.

Therefore, I suggest it will be important to break the print-out into phases. We will need to identify chunks of work that are independent that is don’t depend on ordering, and then impose just enough ordering to handle the rest.

Looking at your desired diagram, I note the following:

we must have ordering from line to line. We want the line of 2 to print after the line of 1, for example.

we have the issue of the variability of the number of leading spaces on the line

for the actual non-whitespace print-out, we can see some opportunity for parallel/independent work there. For example each thread that is responsible for printing could print its number plus a single space. That level of print-out is independent of the ordering among threads.

Based on those ideas I come up with this:

# cat t9.cu
#include <cstdio>
const int N = 5; // must be 1024 or less
__global__ void k(){
for (int i = 1; i < N+1; i++) {
if ((threadIdx.x) > (i-1)) printf(" "); // print leading spaces "in parallel"
__syncthreads(); // before printing any digits
if (threadIdx.x < i) printf("%d ", i); // print the digits for each line "in parallel"
__syncthreads(); // before terminating the line
if (!threadIdx.x) printf("\n"); // terminate each line (using only 1 thread)
__syncthreads(); // before printing the next line
}
}
int main(){
k<<<1,N>>>();
cudaDeviceSynchronize();
cudaError_t err = cudaGetLastError();
if (err != cudaSuccess) printf("%s\n", cudaGetErrorString(err));
}
# nvcc -o t9 t9.cu
# ./t9
1
2 2
3 3 3
4 4 4 4
5 5 5 5 5
#

The formatting choices might need to be adjusted for N > 9.

If CUDA doesn’t guarantee thread ordering, why does the following listing print fibonacci numbers in a perfect order?

#include <stdio.h>
__global__ void fibonacci(int *fib, int n) {
int tid = blockIdx.x * blockDim.x + threadIdx.x;
if (tid <= 1) {
fib[tid] = tid;
return;
}
int previous = 0;
int current = 1;
int next;
for (int i = 2; i <= tid; i++) {
next = previous + current;
previous = current;
current = next;
}
fib[tid] = current;
}
int main() {
int n;
printf("Enter the number of Fibonacci numbers to generate: ");
scanf("%d", &n);
int *d_fib;
int *h_fib = (int *)malloc(n * sizeof(int));
cudaMalloc(&d_fib, n * sizeof(int));
fibonacci<<<1, n>>>(d_fib, n);
cudaMemcpy(h_fib, d_fib, n * sizeof(int), cudaMemcpyDeviceToHost);
printf("Fibonacci Series: ");
for (int i = 0; i < n; i++) {
printf("%d ", h_fib[i]);
}
free(h_fib);
cudaFree(d_fib);
return 0;
}

Because the kernel you have now shown has entirely independent work between threads. The threads can execute in any order, and the results will be calculated properly.

The same statement is not true for the mass of printout in your pyramid example. The various pieces of printout must be done in a particular order.

The “thread ordering” I referred to previously might be clearer if I say “thread execution ordering” CUDA does not guarantee the order in which threads will execute.