Dynamic Parallelism


I just moved from GeForce 580 to Tesla K20c, and wanted to take the advantage of the dynamic parallelism. However, when I tried to convert my test.cu file into test.ptx under Window 7 invironment, I saw the following error.

nvcc : warning : The 'compute_10' and 'sm_10' architectures are deprecated, and may be removed in a future release.
test.cu(4): warning: variable "index" was declared but never referenced

test.cu(14): error: calling a __global__ function("mykernel_child") from a __global__ function("mykernel") is only allowed on the compute_35 architecture or above

1 error detected in the compilation of "C:/Users/.../Local/Temp/tmpxft_00000d0c_00000000-8_test.cpp1.ii".

I got CUDA ToolKit 6.0 installed and convert test.cu using

"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v6.0\bin\nvcc" -ptx -ccbin "C:\Program Files (x86)\Microsoft Visual Studio 10.0\VC\bin" -o test.ptx test.cu

Is there anything wrong with my setting? Thanks in advance!

I carefully checked the machine. I found there are two GPU installed. One is Tesla K20c, the other one is ION. And the ION is responsible for video output. So, I guess the problem is that the compiler doesn’t know the existence of Tesla K20c.

The compiler is not aware of which GPUs are in the system, and does not need to be. However, you need to tell the compiler which architecture(s) it should build for. K20 is sm_35, but your build log seems to indicate you are building for sm_10. I suggest adding -arch or -gencode switch as appropriate for the compiler to target the sm_35 or compute_35 platform (= compute capability 3.5).

Thanks, njuffa! ^_^


I met another problem that needs help. After I compiled the following code with “nvcc -ptx dynamic_parallelism.cu -gencode arch=compute_35,code=sm_35” in Linux, I found the generated .ptx file cannot be used in a Window machine.

__global__ void mykernel_child(){
        int index;

__global__ void mykernel(const int N, double *a, double *b, double *c){
        int index = threadIdx.x;
        if(index < N)
                c[index] = a[index] + b[index];

         mykernel_child <<< 1,1 >>> ();

I have a K20c installed on that machine and I used Matlab to call that ptx file. However, the Matlab keeps reporting

Error using parallel.gpu.CUDAKernel
An error occurred during PTX compilation of <image>.
The information log was:

The error log was:
ptxas : fatal error : Unresolved extern function 'cudaLaunchDevice

Error in Test_CUDA_Matlab_Interface (line 4)
k = parallel.gpu.CUDAKernel('dynamic_parallelism.ptx',
'dynamic_parallelism.cu', 'mykernel')

I guess that problem would be the ptx is not relocatable. So, I tried to compile that baby code in Window machine instead and it failed. The reported error message is below:

dynamic_parallelism.cu(14): error: kernel launch from __device__ or __global__ functions requires separate compilation mode

Many thanks in advance. Any suggestions would be greatly appreciated!


I use dinamic parallelism and CUDA 5 and 6 with 3.5 and 5 architecture.

To compile my code, I use:

nvcc -arch=sm_35 -rdc=true -lcudadevrt name.cu (with cuda 5 and 3.5 arch)
nvcc -arch compute_50 -rdc=true name.cu (with cuda 6 and 5 arch)

When I call a kernel from another kernel performance is very low.