Compile time architecture checking?

dmyablonski · December 29, 2010, 4:17pm

I would like to have code that compiles as arch_sm13 (or better) if the GPU on the machine is capable. The code will run in single precision if on an older GPU and double precision if on a GPU capable of it. The problem is I need to only use the arch_sm13 flag if it is a double precision GPU, and no flag otherwise. The code is prepared to switch the precision based on a #define

Has anyone found which GPU is available at compile time (think: Makefile), and maybe have a script for it?

Also open to other options. Could i always compile with the sm13 flag and then check device properties in the program? I’m not sure what happens if you compile with sm13 on the older architecture.

cudesnick · January 4, 2011, 6:56am

Here’s my understanding of what’s going on in your case:

When building any CUDA app, you essentially compile two different codes using two different compilers (merged into nvcc executable) into two different machine-level instruction sets: the CPU code and the GPU code. Of course, the exception is the degenerate case, i.e. when you don’t have GPU code in your program at all. Excepting this degenerate case, your binary may (and perhaps should) contain one CPU-based and several GPU-based codes, compiled for all the essentially different GPU platforms, which you wish to support. Which one of the GPU codes to use is decided at the run time. Making this decision is transparent to you.

Regarding the GPU-based codes: the define that distinguishes between different generations of the target platform exists, and it’s CUDA_ARCH. For example, to check for Fermi-based GPU, my GPU-based code contains smth. like the following:

#if defined(__CUDA_ARCH__) && 200 <= __CUDA_ARCH__

#define RUNNING_ON_FERMI

#endif

Regarding the CPU-based code: following the one CPU, multiple GPU codes paradigm explained above, you should not have a static dependency on the GPU architecture in the CPU-based code, instead you should make the decision at the run time. My approach to making this decision is to have a 1-block, 1-thread kernel, that sets up variables, depending on the features of the architecture. So, my code contains smth. like the following:

struct DevicePreferences {

int runInDoublePrecision;

int preferCacheToShared;

int numThreads; // Optimal number of threads to fire.

// Other variables dependent on the GPU architecture.

};

__global__ void queryDevicePreferences(DevicePreferences* const pPreferences) {

// Alias. We assume that we're running in 1D grid and block.

  const unsigned block = blockIdx.x,

    thread = threadIdx.x;

if (0 == block && 0 == thread) {

    // Use this thread to initialize the preferences.

pPreferences -> runInDoublePrecision = 

#ifdef RUNNING_ON_FERMI

      1

#else

      0

#endif

      ;

pPreferences -> preferCacheToShared = 

#ifdef RUNNING_ON_FERMI

      1

#else

      0

#endif

      ;

// Set other preferences

}

__syncthreads();

} // of queryDevicePreferences(..)

Then my CPU-based code invokes this kernel in the usual way when the executable starts up. This implies allocating DevicePreferences on the device, running the kernel, retrieving the contents of this structure to the host, freeing the structure on the device. Obviously, if your preferences are unrelated to each other, you might want to split this kernel into a number of such and invoke them independently.

I chose to set up the preferences within the device code in order to localize all the decision making about how the device code should run within the device code. This is not the only possibility, you can use cudaGetDeviceProperties(…) instead (see deviceQuery.cpp in the SDK), for example, so that the decisions are made within the host code.

Hope this helps.