I was about to start a new MPI-CUDA-C++17 using the new CUDA 12.3 project when, by mistake, I noticed an interesting behavior when passing arguments to my kernel function directly from the host: I realized that simple variables (int, float, & double) are implicitly copied at compile time from host to device.
I was aware that structures and class objects were implicitly copied/initialized on the device at compile time and only had to manage transfers of arrays of data only.
Thus my questions:
is this the expected behavior or is this related to unified memory behavior?
if this is the expected behavior, what other types of implicit transfers are also available?
lastly, can I rely on this behavior when developing MPI applications?
I have written a simple hello-world example to document my question:
#include <mpi.h>
#include <iostream>
#include <cuda_runtime.h>
// CUDA kernel function
__global__ void printFromDevice(int deviceId, double somevalue) {
printf("Hello World from GPU! I'm device %d with value %g\n", deviceId, somevalue);
}
int main(int argc, char** argv) {
// Initialize MPI
int rank, size;
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &size);
// Check if there are enough devices
int deviceCount, deviceId;
cudaGetDeviceCount(&deviceCount);
if (deviceCount % size== 0) {
// Set Device to the rank of the process
deviceId = rank;
}
else {
// Set Device all devices to use the same device
deviceId = 0;
}
cudaSetDevice(deviceId);
// Print message from every process
printf("CPU process %d of %d is using device %d\n", rank, size, deviceId);
// Launch CUDA kernel
double somevalue = 3.1416;
printFromDevice<<<1, 1>>>(deviceId, somevalue);
cudaDeviceSynchronize();
// Finalize MPI
MPI_Finalize();
return 0;
}
I compile the above example using CMake v3.26.3 as
cmake_minimum_required(VERSION 3.10)
project(hellow_MPI_CUDA)
# Find MPI package
find_package(MPI REQUIRED)
include_directories(${MPI_INCLUDE_PATH})
# Find CUDA package
find_package(CUDA REQUIRED)
include_directories(${CUDA_INCLUDE_DIRS})
# Set compiler flags for MPI
set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} ${MPI_CXX_COMPILE_FLAGS}")
set(CMAKE_EXE_LINKER_FLAGS "${CMAKE_EXE_LINKER_FLAGS} ${MPI_CXX_LINK_FLAGS}")
# Add executable
cuda_add_executable(hellow main.cu)
# Link libraries
target_link_libraries(hellow ${MPI_CXX_LIBRARIES} ${CUDA_LIBRARIES})
This example is run as:
$ mpirun -np 2 ./hellow
produces the following output:
CPU process 0 of 2 is using device 0
CPU process 1 of 2 is using device 1
Hello World from GPU! I'm device 0 with value 3.1416
Hello World from GPU! I'm device 1 with value 3.1416
To be quite frank, I assumed it was copied because I didn’t get any error. I have only seen this behavior when declaring using class template private member values, so I though this must be related to an implicit mechanism during compilation.
Given that the machine a CUDA program is being compiled on does not even require a GPU, that is not possible.
Generally speaking, for instances of function calls where some arguments of the function are compile-time constants, compiler may create function clones for which one or several compile-time constants have been propagated to the called function. I am not aware that the CUDA compiler applies such an optimization to __global__ functions, but off-hand I do not see a reason it could not do so.
#include <mpi.h>
#include <iostream>
#include <random>
#include <cuda_runtime.h>
// Run with: mpirun -np 2 ./hellow
// CUDA kernel function
__global__ void printFromDevice(int deviceId, double somevalue) {
printf("Hello World from GPU! I'm device %d with value %g\n", deviceId, somevalue);
}
int main(int argc, char** argv) {
// Initialize MPI
int rank, size;
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &size);
// Check if there are enough devices
int deviceCount, deviceId;
cudaGetDeviceCount(&deviceCount);
if (deviceCount % size== 0) {
// Set Device to the rank of the process
deviceId = rank;
}
else {
// Set Device all devices to use the same device
deviceId = 0;
}
cudaSetDevice(deviceId);
// Print message from every process
printf("CPU process %d of %d is using device %d\n", rank, size, deviceId);
// Generate random number
std::random_device rd;
std::mt19937 gen(rd());
std::uniform_real_distribution<> dis(0, 1);
double somevalue = dis(gen);
// Launch CUDA kernel
printFromDevice<<<1, 1>>>(deviceId, somevalue);
cudaDeviceSynchronize();
// Finalize MPI
MPI_Finalize();
return 0;
}
Compiled and run the snippet … (On my laptop with a single GPU) I get:
$ mpirun -np 4 ./hellow
CPU process 3 of 4 is using device 0
CPU process 1 of 4 is using device 0
CPU process 2 of 4 is using device 0
CPU process 0 of 4 is using device 0
Hello World from GPU! I'm device 0 with value 0.469853
Hello World from GPU! I'm device 0 with value 0.887794
Hello World from GPU! I'm device 0 with value 0.996262
Hello World from GPU! I'm device 0 with value 0.473506
(My apologies for my ignorance. All this time I thought that one must pass these parameters by pointers. I was not aware that I could simply use this route)
Can someone help me point where I could find more details in the documentation?