new CUDA 4.0 Driver API Kernel launch call How does it work with templetized kernels?

I’ve been looking at the new CUDA 4.0 Driver API Kernel launch call in the matrixMulDrv example in the NVIDIA GPU Computing SDK 4.0 /C/src folder. The call is in matrixMulDrv.cpp on line 168.

void *args[5] = { &d_C, &d_A, &d_B, &Matrix_Width_A, &Matrix_Width_B};

cutilDrvSafeCallNoSync(cuLaunchKernel( matrixMul, grid.x, grid.y, grid.z,

                                                              block.x, block.y, block.z,

                                                              2*block_size*block_size*sizeof(float), 

                                                              NULL, args, NULL) );

One thing i’ve noviced is that the kernel matrixMul is templetized. It’s defined as

template <int block_size, typename size_type> __device__ void

matrixMul( float* C, float* A, float* B, size_type wA, size_type wB)

in matrixMul_kernel.cu on line 34

I can’t figure out how the templetized parameters int block_size and typename size_type are defined from the main code.

The code works fine in the example, but when I try to integrate the call in my own code, with a templetized kernel call, it doesn’t work.

Does anyone know how the new kernel call works? Does anyone know how to use templetized kernels with the new call?

I’ve been looking at the new CUDA 4.0 Driver API Kernel launch call in the matrixMulDrv example in the NVIDIA GPU Computing SDK 4.0 /C/src folder. The call is in matrixMulDrv.cpp on line 168.

void *args[5] = { &d_C, &d_A, &d_B, &Matrix_Width_A, &Matrix_Width_B};

cutilDrvSafeCallNoSync(cuLaunchKernel( matrixMul, grid.x, grid.y, grid.z,

                                                              block.x, block.y, block.z,

                                                              2*block_size*block_size*sizeof(float), 

                                                              NULL, args, NULL) );

One thing i’ve noviced is that the kernel matrixMul is templetized. It’s defined as

template <int block_size, typename size_type> __device__ void

matrixMul( float* C, float* A, float* B, size_type wA, size_type wB)

in matrixMul_kernel.cu on line 34

I can’t figure out how the templetized parameters int block_size and typename size_type are defined from the main code.

The code works fine in the example, but when I try to integrate the call in my own code, with a templetized kernel call, it doesn’t work.

Does anyone know how the new kernel call works? Does anyone know how to use templetized kernels with the new call?

There are multiple instantiations of that template at the end of the file matrixMul_kernel.cu:

// C wrappers around our template kernel

extern "C" __global__ void matrixMul_bs16_32bit( float* C, float* A, float* B, int wA, int wB )

{

    matrixMul<16, int>( C, A, B, wA, wB );

}

[...]

There are multiple instantiations of that template at the end of the file matrixMul_kernel.cu:

// C wrappers around our template kernel

extern "C" __global__ void matrixMul_bs16_32bit( float* C, float* A, float* B, int wA, int wB )

{

    matrixMul<16, int>( C, A, B, wA, wB );

}

[...]

Does it mean the advantage of writing templates is completely lost in the Driver API since one cannot launch a “templated” kernel as in the runtime API ?

No, it just means you need to use the C++ mangled name of the function corresponding to whichever template instance you are wanting to run.

but where can I find this mangled name ? Well I give you more details on my current problems, I coded a multi threaded dll using openmp and cuda Driver Api, and need now to sort large arrays of numbers. I thought I would be using thrust or CUDPP and discovered that they are based on cuda Runtime Api, which makes it impossible to use them in my code ( maybe I am wrong , but that is what I read in many places). I decided to convert cudpp in cuda driver API and in this code, I found calls like this one :

switch(traitsCode)

    {

    case 0: // single block, single row, non-full last block

        segmentedScan4<T, SegmentedScanTraits<T, op, isBackward, isExclusive, doShiftFlagsLeft, false, false,

                       false> >

            <<< grid, threads, sharedMemSize >>>

            (d_out, d_idata, d_iflags, numElements, 0, 0, 0);

        break;

which is a “double” templatized kernel, making it difficult to run with cuLaunchKernel. Do I need to instantiate all the possible cases in order to make it work or is there some mangled name trick that I can use to get the correct kernel running ?

Thx