but the compiler (CUDA 5.0) show error on this as:
“…error: calling a host function(“malloc”) from a global function(“kernelPersistenceAlgorithm”) is not allowed…”
It is allowed to use malloc() inside the kernel function. Anyone knows why or what should I set right in the projet/compiler?
This could be the problem. I thought the Capability of 3.0 is newer enough for this. I didn’t find any document mentioned this. Can you tell how can I find the SM number of the GPU, GTX680Ti? in the card specification or can run hte Nvidia Cuda sample project to find?
Run the deviceQuery sample, it’ll tell you your compute capability.
With that said, the problem most likely isn’t your card (since you’re using a kepler card), but rather if you’re compiling your code to target a lower one. You should check your makefile or project configuration to make sure you’re not targetting capability 1.x (i.e. make sure you have the compiler flags “-gencode arch=compute_20,code=sm_20” or higher)
Alrikai’s answer is almost certainly correct. Remember that during compile phase, nvcc doesn’t care what device is in your machine, or even if you have a device at all. So you need to explicitly provide the compute level you are compiling to.
But, beyond that, your example will still not work. Dynamic memory allocation on the device works only for global device memory, not for shared memory. If you want dynamic shared memory, you’d have to implement your own allocator by grabbing a large static block of shared memory then suballocating that yourself inside your kernel. There are some, but not many, reasons you’d want to do that.
I run deviceQuery sample again and it tells the capability of this GPU(GTX660Ti) is 3.0.
I also checked the comiler settings and found it is including all sm_xx flags as details as:
-gencode=arch=compute_10,code="sm_10,compute_10" -gencode=arch=compute_20,code="sm_20,compute_20"
-gencode=arch=compute_30,code="sm_30,compute_30" --machine 64 -ccbin “C:\Program Files (x86)\Microsoft Visual Studio 9.0\VC\bin”
-DWIN32 -Xcompiler “/EHsc /W3 /nologo /O2 /Zi /MT " -I"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v5.0
/include” -I"…/…/…/common/inc" -I"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v5.0\include" -maxrregcount=0
–compile -o “x64\Debug/CXCudaKernel.cu.obj” CXCudaKernel.cu
so what I understand the SM number is not the same as Capability number, right? so what is my SM number still?
By the way, as you said, I don’t have to use malloc() to allocate shared memory as long as I can use other way. My case is I need to use the same size of thread size assigned for shared memory in the kernel (the size is changable right before calling kernel…). how and I pass in this size into kernel, and can be treat as a constant, so can assign the shared memory as: Shared float s_data{thread_size];