Just wondering if anyone has done any multi-GPU applications involving heterogeneous GPUs of different compute capabilities? Specifically, what I am trying to do is split my algorithm/data across a version 1.3 device and a 2.0 device, compiled with their respective device versions (-arch = sm_13, -arch=sm_20) using FindCUDA.cmake. Anyone done something like this?
I haven’t looked into this too extensively, but hopefully this’ll help you get started:
In terms of compilation, I believe you have to make use of “CUDA_NVCC_FLAGS” in your CMakeLists.txt (from FindCUDA)in conjunction with the “-gencode” nvcc flag.
In terms of code, you can use the “CUDA_ARCH” macro in your device code.
I haven’t looked into this too extensively, but hopefully this’ll help you get started:
In terms of compilation, I believe you have to make use of “CUDA_NVCC_FLAGS” in your CMakeLists.txt (from FindCUDA)in conjunction with the “-gencode” nvcc flag.
In terms of code, you can use the “CUDA_ARCH” macro in your device code.
Where you include another set of lines for every system you want to compile for (I currently just use 11 and 20 because I don’t use any 13 features).
Then, in the code, you can use CUDA_ARCH in device code to prevent compile errors
__global__ void some_kernel(...)
{
// do something that compiles on the lowest compute capability
}
__global__ void some_kernel_sm20(...)
{
#if __CUDA_ARCH__ >= 200
// do something that uses compute 2.0 features
#endif
}
And then make a runtime decision based on info from cudaDeviceGetProperties to determine which kernel to call.
Another option is to break your sm_11 and sm_20 kernels into separate files and pass the appropriate arch flags to CUDA_COMPILE for that file (see the FindCUDA.cmake docs for details). This way, you don’t need to use the #ifs, but it does require a bit more cmake code and well thought out file organization to manage it. The other advantage is that you don’t have a compiled sm_20 version of som_kernel wasting disk space that will never be called.
Where you include another set of lines for every system you want to compile for (I currently just use 11 and 20 because I don’t use any 13 features).
Then, in the code, you can use CUDA_ARCH in device code to prevent compile errors
__global__ void some_kernel(...)
{
// do something that compiles on the lowest compute capability
}
__global__ void some_kernel_sm20(...)
{
#if __CUDA_ARCH__ >= 200
// do something that uses compute 2.0 features
#endif
}
And then make a runtime decision based on info from cudaDeviceGetProperties to determine which kernel to call.
Another option is to break your sm_11 and sm_20 kernels into separate files and pass the appropriate arch flags to CUDA_COMPILE for that file (see the FindCUDA.cmake docs for details). This way, you don’t need to use the #ifs, but it does require a bit more cmake code and well thought out file organization to manage it. The other advantage is that you don’t have a compiled sm_20 version of som_kernel wasting disk space that will never be called.