new CUDA 4.0 Driver API Kernel launch call How does it work with templetized kernels?

wagw · June 15, 2011, 8:49pm

I’ve been looking at the new CUDA 4.0 Driver API Kernel launch call in the matrixMulDrv example in the NVIDIA GPU Computing SDK 4.0 /C/src folder. The call is in matrixMulDrv.cpp on line 168.

void *args[5] = { &d_C, &d_A, &d_B, &Matrix_Width_A, &Matrix_Width_B};

cutilDrvSafeCallNoSync(cuLaunchKernel( matrixMul, grid.x, grid.y, grid.z,

                                                              block.x, block.y, block.z,

                                                              2*block_size*block_size*sizeof(float), 

                                                              NULL, args, NULL) );

One thing i’ve noviced is that the kernel matrixMul is templetized. It’s defined as

template <int block_size, typename size_type> __device__ void

matrixMul( float* C, float* A, float* B, size_type wA, size_type wB)

in matrixMul_kernel.cu on line 34

I can’t figure out how the templetized parameters int block_size and typename size_type are defined from the main code.

The code works fine in the example, but when I try to integrate the call in my own code, with a templetized kernel call, it doesn’t work.

Does anyone know how the new kernel call works? Does anyone know how to use templetized kernels with the new call?

wagw · June 15, 2011, 8:49pm

I’ve been looking at the new CUDA 4.0 Driver API Kernel launch call in the matrixMulDrv example in the NVIDIA GPU Computing SDK 4.0 /C/src folder. The call is in matrixMulDrv.cpp on line 168.

void *args[5] = { &d_C, &d_A, &d_B, &Matrix_Width_A, &Matrix_Width_B};

cutilDrvSafeCallNoSync(cuLaunchKernel( matrixMul, grid.x, grid.y, grid.z,

                                                              block.x, block.y, block.z,

                                                              2*block_size*block_size*sizeof(float), 

                                                              NULL, args, NULL) );

One thing i’ve noviced is that the kernel matrixMul is templetized. It’s defined as

template <int block_size, typename size_type> __device__ void

matrixMul( float* C, float* A, float* B, size_type wA, size_type wB)

in matrixMul_kernel.cu on line 34

I can’t figure out how the templetized parameters int block_size and typename size_type are defined from the main code.

The code works fine in the example, but when I try to integrate the call in my own code, with a templetized kernel call, it doesn’t work.

Does anyone know how the new kernel call works? Does anyone know how to use templetized kernels with the new call?

njuffa · June 15, 2011, 10:43pm

There are multiple instantiations of that template at the end of the file matrixMul_kernel.cu:

// C wrappers around our template kernel

extern "C" __global__ void matrixMul_bs16_32bit( float* C, float* A, float* B, int wA, int wB )

{

    matrixMul<16, int>( C, A, B, wA, wB );

}

[...]

njuffa · June 15, 2011, 10:43pm

There are multiple instantiations of that template at the end of the file matrixMul_kernel.cu:

// C wrappers around our template kernel

extern "C" __global__ void matrixMul_bs16_32bit( float* C, float* A, float* B, int wA, int wB )

{

    matrixMul<16, int>( C, A, B, wA, wB );

}

[...]

proutix · June 18, 2011, 2:30pm

There are multiple instantiations of that template at the end of the file matrixMul_kernel.cu:
// C wrappers around our template kernel

extern "C" __global__ void matrixMul_bs16_32bit( float* C, float* A, float* B, int wA, int wB )

{

    matrixMul<16, int>( C, A, B, wA, wB );

}

[...]

Does it mean the advantage of writing templates is completely lost in the Driver API since one cannot launch a “templated” kernel as in the runtime API ?

avidday · June 18, 2011, 3:12pm

No, it just means you need to use the C++ mangled name of the function corresponding to whichever template instance you are wanting to run.

proutix · June 18, 2011, 4:03pm

but where can I find this mangled name ? Well I give you more details on my current problems, I coded a multi threaded dll using openmp and cuda Driver Api, and need now to sort large arrays of numbers. I thought I would be using thrust or CUDPP and discovered that they are based on cuda Runtime Api, which makes it impossible to use them in my code ( maybe I am wrong , but that is what I read in many places). I decided to convert cudpp in cuda driver API and in this code, I found calls like this one :

switch(traitsCode)

    {

    case 0: // single block, single row, non-full last block

        segmentedScan4<T, SegmentedScanTraits<T, op, isBackward, isExclusive, doShiftFlagsLeft, false, false,

                       false> >

            <<< grid, threads, sharedMemSize >>>

            (d_out, d_idata, d_iflags, numElements, 0, 0, 0);

        break;

which is a “double” templatized kernel, making it difficult to run with cuLaunchKernel. Do I need to instantiate all the possible cases in order to make it work or is there some mangled name trick that I can use to get the correct kernel running ?

Thx

Topic		Replies	Views
cuLaunchKernel parameters ? CUDA Programming and Performance	7	16155	June 21, 2011
Templated kernels and printf CUDA Programming and Performance	5	9566	December 20, 2008
kernel template Teaching and Curriculum Support	2	1882	July 6, 2013
Error with new CUDA 4.0 Kernel call argument is incompatable with parameter of type "CUfunction& CUDA Programming and Performance	2	1930	June 18, 2011
A problem with template and kernel call Compilation fails in this case CUDA Programming and Performance	2	911	April 22, 2010
Kernel fails to launch, when called from Java CUDA Programming and Performance	4	7363	October 31, 2008
Is CUDA C or C++ ? CUDA Programming and Performance	12	33628	January 30, 2009
OpenMP doesn't work in a templated function CUDA Programming and Performance	4	2244	September 14, 2009
templated functions problem with CUDA CUDA Programming and Performance	5	9269	May 19, 2009
Noob question: Kernel configuration failed, invalid configuration argument CUDA Programming and Performance	7	17683	July 31, 2009

new CUDA 4.0 Driver API Kernel launch call How does it work with templetized kernels?

Related topics