How to use cudaLaunchKernel in cuda7.0

I have a code multiGPU. One class to handle the partition (domain), that hide all the logic of multiGPU. Another file with the computation algorithm.

So I need pass the kernel (algorithm) like an argument to a function of the domain. It compiles without any warning or error. but at execution time crash in the funcion cudaLaunchKernel. All the allocations are fine and the values of arguments at call time too. I’ve checked it using the debugger cuda-gdb and activating the options: launch blocking, memset and api_failures=stop_all

The commandand to compile is: nvcc --compiler-options -Wall,-Wextra -arch=sm_35 -std=c++11 -O3 --fmad=false --prec-div=false --ptxas-options=-v -g -G

What am I doing wrong??? does anyone have any examples???

Note: If I exceute without debugger crash, but print the mesages of the kernel :S:S

File A [Code with the operations to do]:

template< class Model > __global__ void myKernel( void  ) {
    //    doNothing();
    if (threadIdx.x ==0 && threadIdx.y == 0) printf ("hello inside the kernel\n");

  }

  class myClass {

     void execute(MyClassMultiGPU &domain){
       domain.launch((const void*) myKernel<Model>) ;
     };
  }

File B [Code with the deploy information on multiGPUs]:

class myClassMultiGPU {

   private:
     struct ctx_t {...};
     struct_ctx_t ctx_t;

   public:
     __HOST__ inline void launch(const void* func) {
        for (int i = 0; i < ctx.num_dev; i++) {
          cudaSetDevice( i );
          cudaLaunchKernel ( func,                             // Ptr to the global kernel
                             ctx.dimGrid[i], ctx.dimBlock[i],  // Sizes of the kernel
                             (void **) NULL,                   // Ptr to the arguments
                             0,                                // Shared memory
                             ctx.streams[i]);                  // Stream
        }
     };
  }

thanks

I have experienced the same error.