CUDA runtime multi-architecture cubin loading

Hello everybody,

We are trying to test the CUDA runtime dynamic cubin loading according to the current GPU architecture.

We have one CUDA accelerated algorithm which is implemented in two versions under the same function signature: one with dynamic parallelism, one without.

We use conditional compilation (CUDA_ARCH) to generate two cuda binaries realizing the same function with different bodies targeting k2000 without dynamic parallelism (compute_20) and k2200 with dynamic parallelism (compute_35) respectively.

The code executes well on k2000 by loading the non-dynamic-parallelism binary. However when executing on k2200, the cuda runtime does not load the dynamic-parallelism binary as expected and always load the same non-dynamic-parallelism binary.

Below is the test environment we used:

Environment: VS2013
Code Generation: compute_20,sm_20;compute_35,sm_35
GPU under test: Quadro k2200 (Compute capability 5.0), Quadro k2000 (Compute capability 3.0)
NVCC Compilation type: Generate hybrid object file (–compile)

The generated files seem to be coherent:

fast_segment.compute_20.cpp1.ii, fast_segment.compute_35.cpp1.ii
fast_segment.compute_20.sm_20.cubin, fast_segment.compute_35.sm_35.cubin
fast_segment.fatbin

Could you give some hints on how making work the dynamic loading? Or are we understanding wrong about it?

Thank you very much,

Guangye

It would likely help to post a minimal but complete example (e.g. app that uses a single trivial kernel) that others can build and run. It is not at all clear to me what the current code is actually doing. Presumably you are using the CUDA driver API rather than the CUDA runtime?

Question: Given that the Quadro K2000 is an sm_30 device, and the Quadro K2200 is an sm_50 device, why is the code being built for compute_20 and compute_35, respectively, instead of the matching virtual architectures?

Dear njuffa,

Thank you for the quick reply!

Yes, I will prepare a trivial kernel that represent my situation.

As for the virtual machine setting, we were a little bit lazy. The compute_20 is the default cuda7.5 setting under vs2013, while the compute_35 is the minimum requirement for using dynamic parallelism.

One question: How can we make sure that we use cuda runtime instead of cuda driver api? We did include the cudart.lib in the link stage.

Guangye

The two APIs use different prefixes for symbol names: cu* is a symbol from the driver API, cuda* is a symbol from the runtime API. Since you link cudart.lib, it would appear that you are using the CUDA runtime API. I am not sure what you mean by “dynamic cubin loading” in that context. That is why I expect a runnable example app to provide clarification.

I would suggest not relying on JIT compilation for the time being, and instead building a classic fat binary which contains the machine code (SASS) for the two GPU architectures(sm_30, sm_50) you are targeting. Reducing the complexity of the setup in this fashion will likely make diagnosing your observations easier.

Hello njuffa,

I found a mistake we made while preparing the trivial kernel.

We mistakenly put the CUDA_ARCH condition around a host function. The host function resides inside a .cu file which made me believe that the CUDA_ARCH condition was valid.

I list the wrong implementation and the correction below.

Thank you for your help!

Guangye


main.cpp

int main(void)
{
int a = test_cuda1();

printf("GPU architecture: %d\n", a);

return 0;

}


test_cuda.cpp (wrong)

global void dym_kernel_300(int *architecture)
{
*architecture = 300;
}

global void dym_kernel_500(int *architecture)
{
*architecture = 500;
}
#if (CUDA_ARCH <= 300) /*always true since CUDA_ARCH is undefined in host code */
int test_cuda1(void)
{
int *arch_d;
int arch_h;

cudaMalloc(&arch_d, sizeof(int));

dym_kernel_300<<<1, 1, 1 >>>(arch_d);

cudaMemcpy(&arch_h, arch_d, sizeof(int),
	cudaMemcpyDeviceToHost);

return arch_h;

}
#elif
int test_cuda1(void)
{
int *arch_d;
int arch_h;

cudaMalloc(&arch_d, sizeof(int));

dym_kernel_500<<<1, 1, 1 >>>(arch_d);

cudaMemcpy(&arch_h, arch_d, sizeof(int),
	cudaMemcpyDeviceToHost);

return arch_h;

}


test_cuda.cpp (correction)

global void dym_kernel_300_500(int *architecture)
{
#if (CUDA_ARCH == 300)
*architecture = 300;
#elif (CUDA_ARCH == 500)
*architecture = 500;
#endif
}

int test_cuda1(void)
{
int *arch_d;
int arch_h;

cudaMalloc(&arch_d, sizeof(int));

dym_kernel_300_500<<<1, 1, 1 >>>(arch_d);

cudaMemcpy(&arch_h, arch_d, sizeof(int),
	cudaMemcpyDeviceToHost);

return arch_h;

}

Glad to read you figured it out. I always prepend comparisons with CUDA_ARCH with a check that it is actually defined, e.g.

#if defined( __CUDA_ARCH__ ) && ( __CUDA_ARCH__ < 300 )