in-kernel malloc no kernel lauch although code="sm_21,compute_20"

hello!

I’m new to CUDA and I’ve written a program that works quite well if I use an (big enough) array instead of allocating dynamically memory in the kernel.

I think my graphic card GT540m supports compute capabillity 2.1, so I would expect the in-kernel allocation described in the CUDA_C_Programming Guide on p. 138 (I use the CUDA toollkit 4.0) working. I can compile it. But at running it takes about 6 times more time than the array version of this kernel. And when I profile the program, this kernel isn’t shown up as launched, all others kernels are listed there. The array version of this kernel is shown up. Both - malloc and array version - are running correct with equal results in my tests.

So I think, this kernel with malloc and free get emulated by CPU or something like that?

I used the flag at Code generation: compute_20,sm_21 and use the visual studio 2010.

Does somebody have an idea what I could have done false? Or how I can continue to figure out the problem? Or is this an “profiler misinterpretation” and malloc does take 6times more time?

I hope you can help me out there, i need the dynamic allocation cause of string operations.

best regards
gunther

Here the log of the compiling of this kernel:

1> Compiling CUDA source file kernel.cu…
1>
1> D:\Eigene Dateien\Visual Studio 2010\Projects\mdldg\mdldg>“C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v4.0\bin\nvcc.exe” -gencode=arch=compute_20,code=“sm_21,compute_20” --use-local-env --cl-version 2010 -ccbin “C:\Program Files (x86)\Microsoft Visual Studio 10.0\VC\bin” -I"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v4.0\include" -G0 --keep-dir “Debug” -maxrregcount=0 --machine 32 --compile -arch=sm_21 -D_NEXUS_DEBUG -g -Xcompiler "/EHsc /nologo /Od /Zi /MDd " -o “Debug\kernel.cu.obj” “D:\Eigene Dateien\Visual Studio 2010\Projects\mdldg\mdldg\kernel.cu”
1> tmpxft_00001fcc_00000000-11_kernel.ii
1> mdldg.vcxproj -> D:\Eigene Dateien\Visual Studio 2010\Projects\mdldg\Debug\mdldg.exe
1> copy “C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v4.0\bin\cudart*.dll” “D:\Eigene Dateien\Visual Studio 2010\Projects\mdldg\Debug”
1> C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v4.0\bin\cudart32_40_17.dll
1> C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v4.0\bin\cudart64_40_17.dll
1> 2 Datei(en) kopiert.

I’d hazard a guess that the kernel is in fact failing (possibly because there is not enough memory available for the dynamic allocation) and the apparently correct results are just remains from the previous kernel invocation with static allocation. Do you check all return codes for errors?

thank you for your reply!

Yes, because I’ve never had done a cuda program before, I used a very strict error protocoling for easy finding these kind of errors, I got rid of them quite soon. I compiled the program with array version and with malloc version- I haven’t tested them together in one program yet.

I’ve spent some more time with the profiler and figured out, that when I use “System Trace” instead of “Profiler” I get also the result for the kernel with the dynamic allocation. It shows up the statistics in exactly the way of the other kernels, it just takes more time than the array version. So maybe the malloc takes more time or there are other reasons I’m not able to understand yet. I need more experience to understand this powerful tools.

The application now runs about 30% more fast than omp parallel with 8 Threads… So there is capability to improve that ;)

interesting facts:
using a constant array as an pointer in the argument list of the kernel (global void (arg1, … , const int *var, …) is more fast than using it with a constant device var ( device constant int *var); for my application about 20%.
My application needs about 70ms (the array version ), the result for stream compaction via atomicAdd was in this case really enough. (I first tried to find a more elegant method, but this took only 1ms more…)

best,
gunther