CUDA runtime multi-architecture cubin loading

Guangye · November 2, 2016, 1:14am

Hello everybody,

We are trying to test the CUDA runtime dynamic cubin loading according to the current GPU architecture.

We have one CUDA accelerated algorithm which is implemented in two versions under the same function signature: one with dynamic parallelism, one without.

We use conditional compilation (CUDA_ARCH) to generate two cuda binaries realizing the same function with different bodies targeting k2000 without dynamic parallelism (compute_20) and k2200 with dynamic parallelism (compute_35) respectively.

The code executes well on k2000 by loading the non-dynamic-parallelism binary. However when executing on k2200, the cuda runtime does not load the dynamic-parallelism binary as expected and always load the same non-dynamic-parallelism binary.

Below is the test environment we used:

Environment: VS2013
Code Generation: compute_20,sm_20;compute_35,sm_35
GPU under test: Quadro k2200 (Compute capability 5.0), Quadro k2000 (Compute capability 3.0)
NVCC Compilation type: Generate hybrid object file (–compile)

The generated files seem to be coherent:

fast_segment.compute_20.cpp1.ii, fast_segment.compute_35.cpp1.ii
fast_segment.compute_20.sm_20.cubin, fast_segment.compute_35.sm_35.cubin
fast_segment.fatbin

Could you give some hints on how making work the dynamic loading? Or are we understanding wrong about it?

Thank you very much,

Guangye

njuffa · November 2, 2016, 2:05am

It would likely help to post a minimal but complete example (e.g. app that uses a single trivial kernel) that others can build and run. It is not at all clear to me what the current code is actually doing. Presumably you are using the CUDA driver API rather than the CUDA runtime?

Question: Given that the Quadro K2000 is an sm_30 device, and the Quadro K2200 is an sm_50 device, why is the code being built for compute_20 and compute_35, respectively, instead of the matching virtual architectures?

Guangye · November 2, 2016, 5:01am

Dear njuffa,

Thank you for the quick reply!

Yes, I will prepare a trivial kernel that represent my situation.

As for the virtual machine setting, we were a little bit lazy. The compute_20 is the default cuda7.5 setting under vs2013, while the compute_35 is the minimum requirement for using dynamic parallelism.

One question: How can we make sure that we use cuda runtime instead of cuda driver api? We did include the cudart.lib in the link stage.

Guangye

njuffa · November 2, 2016, 5:14am

The two APIs use different prefixes for symbol names: cu* is a symbol from the driver API, cuda* is a symbol from the runtime API. Since you link cudart.lib, it would appear that you are using the CUDA runtime API. I am not sure what you mean by “dynamic cubin loading” in that context. That is why I expect a runnable example app to provide clarification.

I would suggest not relying on JIT compilation for the time being, and instead building a classic fat binary which contains the machine code (SASS) for the two GPU architectures(sm_30, sm_50) you are targeting. Reducing the complexity of the setup in this fashion will likely make diagnosing your observations easier.

Guangye · November 2, 2016, 6:43am

Hello njuffa,

I found a mistake we made while preparing the trivial kernel.

We mistakenly put the CUDA_ARCH condition around a host function. The host function resides inside a .cu file which made me believe that the CUDA_ARCH condition was valid.

I list the wrong implementation and the correction below.

Thank you for your help!

Guangye

main.cpp

int main(void)
{
int a = test_cuda1();

printf("GPU architecture: %d\n", a);

return 0;

}

test_cuda.cpp (wrong)

global void dym_kernel_300(int *architecture)
{
*architecture = 300;
}

global void dym_kernel_500(int *architecture)
{
*architecture = 500;
}
#if (CUDA_ARCH <= 300) /*always true since CUDA_ARCH is undefined in host code */
int test_cuda1(void)
{
int *arch_d;
int arch_h;

cudaMalloc(&arch_d, sizeof(int));

dym_kernel_300<<<1, 1, 1 >>>(arch_d);

cudaMemcpy(&arch_h, arch_d, sizeof(int),
	cudaMemcpyDeviceToHost);

return arch_h;

}
#elif
int test_cuda1(void)
{
int *arch_d;
int arch_h;

cudaMalloc(&arch_d, sizeof(int));

dym_kernel_500<<<1, 1, 1 >>>(arch_d);

cudaMemcpy(&arch_h, arch_d, sizeof(int),
	cudaMemcpyDeviceToHost);

return arch_h;

}

test_cuda.cpp (correction)

global void dym_kernel_300_500(int *architecture)
{
#if (CUDA_ARCH == 300)
*architecture = 300;
#elif (CUDA_ARCH == 500)
*architecture = 500;
#endif
}

int test_cuda1(void)
{
int *arch_d;
int arch_h;

cudaMalloc(&arch_d, sizeof(int));

dym_kernel_300_500<<<1, 1, 1 >>>(arch_d);

cudaMemcpy(&arch_h, arch_d, sizeof(int),
	cudaMemcpyDeviceToHost);

return arch_h;

}

njuffa · November 2, 2016, 8:05am

Glad to read you figured it out. I always prepend comparisons with CUDA_ARCH with a check that it is actually defined, e.g.

#if defined( __CUDA_ARCH__ ) && ( __CUDA_ARCH__ < 300 )

Topic		Replies	Views
Problem with arch=sm_20 CUDA Programming and Performance	16	4234	March 4, 2011
Cuda Portability and SharedMem vs Cache CUDA Programming and Performance	9	11621	October 18, 2010
GeForce 335M + Visual Studio 2012 CUDA Setup and Installation	7	1405	November 30, 2015
Slow compile and cudaMalloc CUDA Programming and Performance	8	3700	February 2, 2011
A few new to CUDA questions CUDA Programming and Performance	3	1112	February 4, 2011
__CUDA_ARCH__ undefined by NVCC on CUDA 3.2 RC CUDA Programming and Performance	15	3753	November 26, 2010
k20c Problem Dynamic Parallelism CUDA Programming and Performance	1	566	March 22, 2013
a problem complex array add with cuda ????? CUDA Programming and Performance	0	499	August 16, 2017
Parallel Reduction CUDA Programming and Performance	10	3719	June 26, 2011
cuda 3.2 slower than cuda 2.0 ? CUDA Programming and Performance	11	4345	November 3, 2010

CUDA runtime multi-architecture cubin loading

Environment: VS2013 Code Generation: compute_20,sm_20;compute_35,sm_35 GPU under test: Quadro k2200 (Compute capability 5.0), Quadro k2000 (Compute capability 3.0) NVCC Compilation type: Generate hybrid object file (–compile)

fast_segment.compute_20.cpp1.ii, fast_segment.compute_35.cpp1.ii fast_segment.compute_20.sm_20.cubin, fast_segment.compute_35.sm_35.cubin fast_segment.fatbin

main.cpp

test_cuda.cpp (wrong)

test_cuda.cpp (correction)

Related topics

Environment: VS2013
Code Generation: compute_20,sm_20;compute_35,sm_35
GPU under test: Quadro k2200 (Compute capability 5.0), Quadro k2000 (Compute capability 3.0)
NVCC Compilation type: Generate hybrid object file (–compile)

fast_segment.compute_20.cpp1.ii, fast_segment.compute_35.cpp1.ii
fast_segment.compute_20.sm_20.cubin, fast_segment.compute_35.sm_35.cubin
fast_segment.fatbin