How can I print a number pyramid in CUDA?

I wanted to print the following number pyramid in CUDA:

    1 
   2 2 
  3 3 3 
 4 4 4 4 
5 5 5 5 5 

The follwing listing is the CPU version which works properly.

Host-only listing:

#include <stdio.h>

#define N 5 // Number of rows in the pyramid

void printPyramid()
{
    for (int tid = 0; tid < N; tid++)
    {
        // Calculate the number of spaces before the current row
        int numSpaces = N - tid - 1;

        // Calculate the starting number for the current row
        int startNum = tid + 1;

        // Print the spaces before the numbers
        for (int i = 0; i < numSpaces; i++)
        {
            printf(" ");
        }

        // Print the numbers for the current row
        for (int i = 0; i < startNum; i++)
        {
            printf("%d ", startNum);
        }

        printf("\n");
    }
}

int main()
{
    printPyramid();
    
    return 0;
}

However, my CUDA listing doesn’t work as expected and doesn’t produce the proper output:

user_name@192:~/CUDA_$ ./exe
          3 2 3 1 2 3 5 4 5 4 5 4 5 4 5




user_name@192:~/CUDA_$ 

CUDA listing:

#include <stdio.h>

#define N 5 // Number of rows in the pyramid

__global__ void printPyramid()
{
    int tid = threadIdx.x;
    
    // Calculate the number of spaces before the current row
    int numSpaces = N - tid - 1;
    
    // Calculate the starting number for the current row
    int startNum = tid + 1;
    
    // Print the spaces before the numbers
    for (int i = 0; i < numSpaces; i++)
    {
        printf(" ");
    }
    
    // Print the numbers for the current row
    for (int i = 0; i < startNum; i++)
    {
        printf("%d ", startNum);
    }
    
    printf("\n");
}

int main()
{
    // Launch the CUDA kernel with 1 block and N threads per block
    printPyramid<<<1, N>>>();
    
    // Wait for the kernel to finish
    cudaDeviceSynchronize();
    
    // Check for any errors during the kernel launch
    cudaError_t error = cudaGetLastError();
    if (error != cudaSuccess)
    {
        printf("CUDA error: %s\n", cudaGetErrorString(error));
        return 1;
    }
    
    return 0;
}

How can I fix this listing?

CUDA doesn’t impose any particular thread execution ordering (unless you impose it). Therefore we would have to approach such a problem carefully to write “proper” code. We’d like to avoid using extensive serialization of threads in a warp; if we had extensive serialization there (which is certainly one way to solve a number of issues) we might just as well write a single-threaded realization.

Therefore, I suggest it will be important to break the print-out into phases. We will need to identify chunks of work that are independent that is don’t depend on ordering, and then impose just enough ordering to handle the rest.

Looking at your desired diagram, I note the following:

  • we must have ordering from line to line. We want the line of 2 to print after the line of 1, for example.
  • we have the issue of the variability of the number of leading spaces on the line
  • for the actual non-whitespace print-out, we can see some opportunity for parallel/independent work there. For example each thread that is responsible for printing could print its number plus a single space. That level of print-out is independent of the ordering among threads.

Based on those ideas I come up with this:

# cat t9.cu
#include <cstdio>
const int N = 5; // must be 1024 or less

__global__ void k(){

  for (int i = 1; i < N+1; i++) {
    if ((threadIdx.x) > (i-1)) printf(" "); // print leading spaces "in parallel"
    __syncthreads();                             // before printing any digits
    if (threadIdx.x < i) printf("%d ", i);  // print the digits for each line "in parallel"
    __syncthreads();                             // before terminating the line
    if (!threadIdx.x) printf("\n");         // terminate each line (using only 1 thread)
    __syncthreads();                             // before printing the next line
  }
}

int main(){

  k<<<1,N>>>();
  cudaDeviceSynchronize();
  cudaError_t err = cudaGetLastError();
  if (err != cudaSuccess) printf("%s\n", cudaGetErrorString(err));
}
# nvcc -o t9 t9.cu
# ./t9
    1
   2 2
  3 3 3
 4 4 4 4
5 5 5 5 5
#

The formatting choices might need to be adjusted for N > 9.

If CUDA doesn’t guarantee thread ordering, why does the following listing print fibonacci numbers in a perfect order?

#include <stdio.h>

__global__ void fibonacci(int *fib, int n) {
    int tid = blockIdx.x * blockDim.x + threadIdx.x;
    
    if (tid <= 1) {
        fib[tid] = tid;
        return;
    }
    
    int previous = 0;
    int current = 1;
    int next;
    
    for (int i = 2; i <= tid; i++) {
        next = previous + current;
        previous = current;
        current = next;
    }
    
    fib[tid] = current;
}

int main() {
    int n;
    
    printf("Enter the number of Fibonacci numbers to generate: ");
    scanf("%d", &n);
    
    int *d_fib;
    int *h_fib = (int *)malloc(n * sizeof(int));
    
    cudaMalloc(&d_fib, n * sizeof(int));
    
    fibonacci<<<1, n>>>(d_fib, n);
    
    cudaMemcpy(h_fib, d_fib, n * sizeof(int), cudaMemcpyDeviceToHost);
    
    printf("Fibonacci Series: ");
    for (int i = 0; i < n; i++) {
        printf("%d ", h_fib[i]);
    }
    
    free(h_fib);
    cudaFree(d_fib);
    
    return 0;
}

Because the kernel you have now shown has entirely independent work between threads. The threads can execute in any order, and the results will be calculated properly.

The same statement is not true for the mass of printout in your pyramid example. The various pieces of printout must be done in a particular order.

The “thread ordering” I referred to previously might be clearer if I say “thread execution ordering” CUDA does not guarantee the order in which threads will execute.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.