Error message on allocating __shared__ memory in kernel, Cuda 5.0

Refer to the sample code I should be able to dynamically allocate shared memory inside the kernel as:

shared int* s_i_data;
if (threadIdx.x == 0)
{
s_i_data = (int*)malloc(blockDim.x * 4);
}
__syncthreads();

but the compiler (CUDA 5.0) show error on this as:
“…error: calling a host function(“malloc”) from a global function(“kernelPersistenceAlgorithm”) is not allowed…”

It is allowed to use malloc() inside the kernel function. Anyone knows why or what should I set right in the projet/compiler?

You need to target SM 2.0 or newer to use malloc and printf.

This could be the problem. I thought the Capability of 3.0 is newer enough for this. I didn’t find any document mentioned this. Can you tell how can I find the SM number of the GPU, GTX680Ti? in the card specification or can run hte Nvidia Cuda sample project to find?

Run the deviceQuery sample, it’ll tell you your compute capability.
With that said, the problem most likely isn’t your card (since you’re using a kepler card), but rather if you’re compiling your code to target a lower one. You should check your makefile or project configuration to make sure you’re not targetting capability 1.x (i.e. make sure you have the compiler flags “-gencode arch=compute_20,code=sm_20” or higher)

Alrikai’s answer is almost certainly correct. Remember that during compile phase, nvcc doesn’t care what device is in your machine, or even if you have a device at all. So you need to explicitly provide the compute level you are compiling to.

But, beyond that, your example will still not work. Dynamic memory allocation on the device works only for global device memory, not for shared memory. If you want dynamic shared memory, you’d have to implement your own allocator by grabbing a large static block of shared memory then suballocating that yourself inside your kernel. There are some, but not many, reasons you’d want to do that.

I run deviceQuery sample again and it tells the capability of this GPU(GTX660Ti) is 3.0.
I also checked the comiler settings and found it is including all sm_xx flags as details as:

-gencode=arch=compute_10,code="sm_10,compute_10" -gencode=arch=compute_20,code="sm_20,compute_20"
-gencode=arch=compute_30,code="sm_30,compute_30" --machine 64 -ccbin “C:\Program Files (x86)\Microsoft Visual Studio 9.0\VC\bin”
-DWIN32 -Xcompiler “/EHsc /W3 /nologo /O2 /Zi /MT " -I"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v5.0
/include” -I"…/…/…/common/inc" -I"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v5.0\include" -maxrregcount=0
–compile -o “x64\Debug/CXCudaKernel.cu.obj” CXCudaKernel.cu
so what I understand the SM number is not the same as Capability number, right? so what is my SM number still?

  1. By the way, as you said, I don’t have to use malloc() to allocate shared memory as long as I can use other way. My case is I need to use the same size of thread size assigned for shared memory in the kernel (the size is changable right before calling kernel…). how and I pass in this size into kernel, and can be treat as a constant, so can assign the shared memory as:
    Shared float s_data{thread_size];

One thing you can do is declare your shared memory as

extern __shared__ float s_mem[];

Then when you launch your kernel you can specify the amount of shared memory to use as the 3rd launch parameter, i.e.

kernel (((grid_size, block_size, smem_size))) (parameter list);

where smem_size is the size in bytes to be (dynamically) allocated per block.
See Programming Guide :: CUDA Toolkit Documentation for more details

NOTE: I used (((…))) as the launch configuration to denote the triple-chevron syntax, as the forums don’t do well with less-than symbols

I saw this in one of hte sample from cuda, will give a try.

Thanks.