Implicity Memory Transfers with Kernels

Hello CUDA experts,

I was about to start a new MPI-CUDA-C++17 using the new CUDA 12.3 project when, by mistake, I noticed an interesting behavior when passing arguments to my kernel function directly from the host: I realized that simple variables (int, float, & double) are implicitly copied at compile time from host to device.

I was aware that structures and class objects were implicitly copied/initialized on the device at compile time and only had to manage transfers of arrays of data only.

Thus my questions:

  • is this the expected behavior or is this related to unified memory behavior?
  • if this is the expected behavior, what other types of implicit transfers are also available?
  • lastly, can I rely on this behavior when developing MPI applications?

I have written a simple hello-world example to document my question:

#include <mpi.h>
#include <iostream>
#include <cuda_runtime.h>

// CUDA kernel function
__global__ void printFromDevice(int deviceId, double somevalue) {
    printf("Hello World from GPU! I'm device %d with value %g\n", deviceId, somevalue);
}

int main(int argc, char** argv) {

    // Initialize MPI
    int rank, size;
    MPI_Init(&argc, &argv);
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);
    MPI_Comm_size(MPI_COMM_WORLD, &size);

    // Check if there are enough devices
    int deviceCount, deviceId;
    cudaGetDeviceCount(&deviceCount);

    if (deviceCount % size== 0) {
      // Set Device to the rank of the process
      deviceId = rank;
    }
    else {
      // Set Device all devices to use the same device
      deviceId = 0;
    }
    cudaSetDevice(deviceId);
    
    // Print message from every process
    printf("CPU process %d of %d is using device %d\n", rank, size, deviceId);

    // Launch CUDA kernel
    double somevalue = 3.1416;
    printFromDevice<<<1, 1>>>(deviceId, somevalue);
    cudaDeviceSynchronize();

    // Finalize MPI
    MPI_Finalize();

    return 0;
}

I compile the above example using CMake v3.26.3 as

cmake_minimum_required(VERSION 3.10)
project(hellow_MPI_CUDA)

# Find MPI package
find_package(MPI REQUIRED)
include_directories(${MPI_INCLUDE_PATH})

# Find CUDA package
find_package(CUDA REQUIRED)
include_directories(${CUDA_INCLUDE_DIRS})

# Set compiler flags for MPI
set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} ${MPI_CXX_COMPILE_FLAGS}")
set(CMAKE_EXE_LINKER_FLAGS "${CMAKE_EXE_LINKER_FLAGS} ${MPI_CXX_LINK_FLAGS}")

# Add executable
cuda_add_executable(hellow main.cu)

# Link libraries
target_link_libraries(hellow ${MPI_CXX_LIBRARIES} ${CUDA_LIBRARIES})

This example is run as:

$ mpirun -np 2 ./hellow

produces the following output:

CPU process 0 of 2 is using device 0
CPU process 1 of 2 is using device 1
Hello World from GPU! I'm device 0 with value 3.1416
Hello World from GPU! I'm device 1 with value 3.1416

Cheers.

How do you see from the output, whether the value was copied at compile-time or runtime?

To be quite frank, I assumed it was copied because I didn’t get any error. I have only seen this behavior when declaring using class template private member values, so I though this must be related to an implicit mechanism during compilation.

Given that the machine a CUDA program is being compiled on does not even require a GPU, that is not possible.

Generally speaking, for instances of function calls where some arguments of the function are compile-time constants, compiler may create function clones for which one or several compile-time constants have been propagated to the called function. I am not aware that the CUDA compiler applies such an optimization to __global__ functions, but off-hand I do not see a reason it could not do so.

Let me test that rather quick … I’ll re-use the above snippet and change the somevalue double variable to be randomly generated at runtime.

#include <mpi.h>
#include <iostream>
#include <random>
#include <cuda_runtime.h>

// Run with: mpirun -np 2 ./hellow

// CUDA kernel function
__global__ void printFromDevice(int deviceId, double somevalue) {
    printf("Hello World from GPU! I'm device %d with value %g\n", deviceId, somevalue);
}

int main(int argc, char** argv) {

    // Initialize MPI
    int rank, size;
    MPI_Init(&argc, &argv);
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);
    MPI_Comm_size(MPI_COMM_WORLD, &size);

    // Check if there are enough devices
    int deviceCount, deviceId;
    cudaGetDeviceCount(&deviceCount);

    if (deviceCount % size== 0) {
      // Set Device to the rank of the process
      deviceId = rank;
    }
    else {
      // Set Device all devices to use the same device
      deviceId = 0;
    }
    cudaSetDevice(deviceId);
    
    // Print message from every process
    printf("CPU process %d of %d is using device %d\n", rank, size, deviceId);

    // Generate random number
    std::random_device rd;
    std::mt19937 gen(rd());
    std::uniform_real_distribution<> dis(0, 1);
    double somevalue = dis(gen);

    // Launch CUDA kernel
    printFromDevice<<<1, 1>>>(deviceId, somevalue);
    cudaDeviceSynchronize();

    // Finalize MPI
    MPI_Finalize();

    return 0;
}

Compiled and run the snippet … (On my laptop with a single GPU) I get:

$ mpirun -np 4 ./hellow
CPU process 3 of 4 is using device 0
CPU process 1 of 4 is using device 0
CPU process 2 of 4 is using device 0
CPU process 0 of 4 is using device 0
Hello World from GPU! I'm device 0 with value 0.469853
Hello World from GPU! I'm device 0 with value 0.887794
Hello World from GPU! I'm device 0 with value 0.996262
Hello World from GPU! I'm device 0 with value 0.473506

This is still working … how come?

I think I get it now: what I’m doing is passing a single argument by value.

This reply hinted me what is going on.

(My apologies for my ignorance. All this time I thought that one must pass these parameters by pointers. I was not aware that I could simply use this route)

Can someone help me point where I could find more details in the documentation?

That is nice about the runtime API, how well it is integrated in C++.

The value parameters are useful for data, which is the same for all threads.

See link here to the __global__ Function Parameters:

__global__ function parameters are passed to the device via constant memory and are limited to 32,764 bytes starting with Volta […].

Then there is a description of which parameter types are valid and how they are differently copied than normal C++ objects.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.