Cmake and and Heterogenious GPUs

enemyben88 · September 14, 2010, 9:48pm

Hi all,

Just wondering if anyone has done any multi-GPU applications involving heterogeneous GPUs of different compute capabilities? Specifically, what I am trying to do is split my algorithm/data across a version 1.3 device and a 2.0 device, compiled with their respective device versions (-arch = sm_13, -arch=sm_20) using FindCUDA.cmake. Anyone done something like this?

Thanks!

zeus13i · September 15, 2010, 6:32am

Hi Andy,

I haven’t looked into this too extensively, but hopefully this’ll help you get started:

In terms of compilation, I believe you have to make use of “CUDA_NVCC_FLAGS” in your CMakeLists.txt (from FindCUDA)in conjunction with the “-gencode” nvcc flag.

In terms of code, you can use the “CUDA_ARCH” macro in your device code.

See 3.1.4 of Programming Guide 3.1

zeus13i · September 15, 2010, 6:32am

Hi Andy,

I haven’t looked into this too extensively, but hopefully this’ll help you get started:

In terms of compilation, I believe you have to make use of “CUDA_NVCC_FLAGS” in your CMakeLists.txt (from FindCUDA)in conjunction with the “-gencode” nvcc flag.

In terms of code, you can use the “CUDA_ARCH” macro in your device code.

See 3.1.4 of Programming Guide 3.1

Sarnath · September 15, 2010, 6:51am

My Dear Enemy,

Is it possible to tell NVCC to compile for multiple archs (like -arch 1.0,2.0,3.0 etc.)? Then, one can easily update FindCuda.Make

Sarnath · September 15, 2010, 6:51am

My Dear Enemy,

Is it possible to tell NVCC to compile for multiple archs (like -arch 1.0,2.0,3.0 etc.)? Then, one can easily update FindCuda.Make

MisterAnderson42 · September 15, 2010, 12:22pm

The easy way to do this is to do what the Fermi compatibility guide recommends and add the following command line arguments.

list(APPEND CUDA_NVCC_FLAGS "-gencode=arch=compute_11,code=sm_11")

list(APPEND CUDA_NVCC_FLAGS "-gencode=arch=compute_11,code=compute_11")

list(APPEND CUDA_NVCC_FLAGS "-gencode=arch=compute_20,code=sm_20")

list(APPEND CUDA_NVCC_FLAGS "-gencode=arch=compute_20,code=compute_20")

Where you include another set of lines for every system you want to compile for (I currently just use 11 and 20 because I don’t use any 13 features).

Then, in the code, you can use CUDA_ARCH in device code to prevent compile errors

__global__ void some_kernel(...)

	{

	// do something that compiles on the lowest compute capability

	}

__global__ void some_kernel_sm20(...)

	{

	#if __CUDA_ARCH__ >= 200

	// do something that uses compute 2.0 features

	#endif

	}

And then make a runtime decision based on info from cudaDeviceGetProperties to determine which kernel to call.

Another option is to break your sm_11 and sm_20 kernels into separate files and pass the appropriate arch flags to CUDA_COMPILE for that file (see the FindCUDA.cmake docs for details). This way, you don’t need to use the #ifs, but it does require a bit more cmake code and well thought out file organization to manage it. The other advantage is that you don’t have a compiled sm_20 version of som_kernel wasting disk space that will never be called.

MisterAnderson42 · September 15, 2010, 12:22pm

The easy way to do this is to do what the Fermi compatibility guide recommends and add the following command line arguments.

list(APPEND CUDA_NVCC_FLAGS "-gencode=arch=compute_11,code=sm_11")

list(APPEND CUDA_NVCC_FLAGS "-gencode=arch=compute_11,code=compute_11")

list(APPEND CUDA_NVCC_FLAGS "-gencode=arch=compute_20,code=sm_20")

list(APPEND CUDA_NVCC_FLAGS "-gencode=arch=compute_20,code=compute_20")

Where you include another set of lines for every system you want to compile for (I currently just use 11 and 20 because I don’t use any 13 features).

Then, in the code, you can use CUDA_ARCH in device code to prevent compile errors

__global__ void some_kernel(...)

	{

	// do something that compiles on the lowest compute capability

	}

__global__ void some_kernel_sm20(...)

	{

	#if __CUDA_ARCH__ >= 200

	// do something that uses compute 2.0 features

	#endif

	}

And then make a runtime decision based on info from cudaDeviceGetProperties to determine which kernel to call.

Another option is to break your sm_11 and sm_20 kernels into separate files and pass the appropriate arch flags to CUDA_COMPILE for that file (see the FindCUDA.cmake docs for details). This way, you don’t need to use the #ifs, but it does require a bit more cmake code and well thought out file organization to manage it. The other advantage is that you don’t have a compiled sm_20 version of som_kernel wasting disk space that will never be called.

enemyben88 · September 26, 2010, 11:46pm

Thanks all, it worked great!

enemyben88 · September 26, 2010, 11:46pm

Thanks all, it worked great!

enemyben88 · September 26, 2010, 11:46pm

Thanks all, it worked great!

E.D_Riedijk · September 27, 2010, 4:15am

Just a quick question from someone who doesn’t need to support multiple targets (yet).

If you generate sm_11 code, does it still get optimized for 1.2 & 1.3 hardware (double amount of registers)?

E.D_Riedijk · September 27, 2010, 4:15am

Just a quick question from someone who doesn’t need to support multiple targets (yet).

If you generate sm_11 code, does it still get optimized for 1.2 & 1.3 hardware (double amount of registers)?

E.D_Riedijk · September 27, 2010, 4:15am

Just a quick question from someone who doesn’t need to support multiple targets (yet).

If you generate sm_11 code, does it still get optimized for 1.2 & 1.3 hardware (double amount of registers)?

Topic		Replies	Views
Cuda Portability and SharedMem vs Cache CUDA Programming and Performance	9	11621	October 18, 2010
How to check the Version of GPU to dynamically set '-gencode=arch=compute_?'? CUDA Programming and Performance	19	4161	August 14, 2023
Compile time architecture checking? CUDA Programming and Performance	1	1009	January 4, 2011
Compiling for the right architecture CUDA Programming and Performance	14	1753	September 14, 2010
What happens when no arch flags passed by CMAKE CUDA Programming and Performance	3	527	April 3, 2024
Porting to the GPU Any Easy way to Port Code CUDA Programming and Performance	18	17827	October 20, 2009
Slow compile and cudaMalloc CUDA Programming and Performance	8	3694	February 2, 2011
Combined __host__ __device__ functions How to tell if it is device or host executing? CUDA Programming and Performance	9	19039	December 10, 2009
CUDA 4.0 -arch sm_20 39% slower Tesla C2050 ECC on __device__ function PTX duplicated? CUDA Programming and Performance	3	1713	May 21, 2012
Understanding code optimization resulting from the --gpu-architecture, --gpu-code and --generate-code flags CUDA NVCC Compiler	1	827	May 31, 2024

Cmake and and Heterogenious GPUs

Related topics