cant launch kernel

after found what was the reason to the warning of

"nvlink warning : Stack size for entry function...cannot be statically determined

it was a virtual definition of a destructor. when i removed the “virtual” the warning is disappeared
but now when i try to run\debug the kernel i receive the following error

warning: Cuda API error detected: cudaLaunch returned (0x2)

i tried to debug the launch and added --keep flag
however i reached up to cuda_runtime.h
template

__inline__ __host__ cudaError_t cudaLaunch(
  T *func
)
{
  return ::cudaLaunch((const void*)func);
}

how i can go deeper? to find the reason to the error?

  • if i add the virtual flags that i removed, release and debug are working
  • profiler not working with or without warning
  • Updates:

After some optimizations we have a working profiler in 6.5 toolkit (the same code does not profile in 7.5, just freeze at kernel launch, tough in 6.5 and 7.5 the code run in well in release and debug)
i found that the kernel uses much more registers per thread:

  • the ptxas info indicate that there are X registers required for kernel
  • in the properties of the profiler output there are approx 3x-10x more registers depend on the kernel.

i found this post https://devtalk.nvidia.com/default/topic/524075/inconsistency-issue-between-visual-profiler-and-nvcc-compiler-ptxas-info/
that describe similar problems.
the kernel are individual ( does not call other kernel)
but i have some new operator to dynamically allocate classes, can the operator new for multiple classes can cause such high register usage, much higher and different from the output of ptxas?

Error 2 is identified here:

http://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html#group__CUDART__TYPES_1g3f51e3575c2178246db0a94a430e0038

“cudaErrorMemoryAllocation = 2
The API call failed because it was unable to allocate enough memory to perform the requested operation.”

I don’t think you’re going to be able to further understand that with a debugger. Since the kernel is not launching, you won’t be able to learn anything by trying to debug device code anyway. You should take a look at the memory allocations required by your kernel launch. You may be able to begin this process by compiling your code with:

-Xptxas -v

and study the output associated with the kernel that is failing to launch.

i tried, and made a lot of optimizations regards the output of -v,
reduced stack usage and currently have 39 registers on debug mode and approx 80 registers in release

what bother me that if the virtual exist the code runs well. but with the warning.

if i run the kernel with the virtual i have over 2GB memory free of 3GB(using nvidia smi)

what can the connection between the warning the the error

how about using C++ API : cudaLaunch(func); directly (without your wrapper func.)?

what will change? it will only save a function call, nothing will be changed, the kernel will be launched the same.

Anyway i tried it and faild:
insted of

MyKernel<<<...>>>(...)

i used

::cudaLaunch("UH\211åH\203ì@èz¢ûÿH\211}øH\211uðH\211Uè\211MäD\211EàL\211MØL\213MØD\213Eà\213MäH\213UèH\213uðH\213EøH\213} H\211|$\020H\213}\030H\211|$\bH\213}\020H\211<$H\211Çè3þÿÿÉÃUH\211åH\203ì0è\e¢ûÿH\211}øH\211uðH\211Uè\211MäD\211EàL\211MØH\215")

and received error on kernel launch

warning: Cuda API error detected: cudaLaunch returned (0x9)

[b]

  • there are updates in the first post

[/b]

hmmm… I usually launch kernal in .cpp (not in .cu) as follows:

// color_to_grayscale<<<grid,block>>>(g, c, width, height, stride) 
    void* args[] = { &g, &c, &width, &height, &stride };
    cudaLaunchKernel<void>(&color_to_grayscale, grid, block, args);