cuMemAlloc() How to use in __device__

I am trying to use cuMemAlloc in a device function and I am having no success. Here is my code:

struct Combination


	long n;

    long k;

	long* data;


__device__ void Comb_Construct(long n, long k, Combination* c)


  c->n = n;

  c->k = k;

 cuMemAlloc(&c->data, 5*sizeof(long));


The error I get is: error:

1> identifier “cuMemAlloc” is undefined

I have this code in a file that I include in my file. How is it that cuMemAlloc is undefined? What am I doing wrong? The CUDA programming manual doesn’t provide any clues to my problem.


cuMemAlloc is a host function. It cannot be called on the device.

Thanks. I also have a few questions regarding memory allocation.

  1. Is cuMemAlloc() the same as cudaMalloc()?

  2. I am using cudaMalloc on the host code to allocate memory on the device. Does this allocate memory on the global or shared memory of the device?

The reason I ask is because there are many different things I am allocating on the device. In particular, one 96 byte piece of data and another piece of data that is over 34MB. Since the 34MB will not fit into shared memory, I want that to be in global memory and the other 96 byte to be in shared memory for every block (essentially copied).

  1. How do I know how many registers my code is using? Right now either I am exceeding shared memory or my register count because I get the following compile error:

Cuda error: Kernel execution failed in file ‘’ in line 221 : too many
resources requested for launch.

Thanks. I appreciate any help.

As far as I know, they are the same.


Shared memory is per block and limited to 16 KiB. You can allocate shared memory statically by declaring an array of a compile time known size (i.e. shared int sdata[1000]) or dynamically at run time by specifying the third parameter to the kernel call (see the programming guide for details).

You obviously cannot copy all 34MiB to the shared memory of a single block at once. Hopefully, each block only needs a portion of that memory, in which case you should only load the relavent sections. If every block needs to go over the entire 34 MiB, your best bet is a sliding window technique (load in a chunk, have all threads in the block process it then move on to the next chunk).

Compile with the -cubin or -keep argument and examine the cubin file. It will tell you how many registers per thread are being used by your kernel. This error message might also result from requesting too much shared memory per block in the launch configuration.

Thanks MisterAnderson. I have another question.

I am trying to generate 1,712,304 threads. So for my kernel function call I have:

kernelfunction<<< 3345, 512 >>>(d_count);

This gives me a little over the number of threads that I want, but that is OK because in my kernel function I have the following:

__global__ void kernelfunction(unsigned int* g_count)


   unsigned int tid = blockIdx.x*512 + threadIdx.x;

   if (tid < 1712304)


     *g_count = *g_count + 1;




This is just test code I put in to see if I am getting all of the threads that I want. In my host code d_count is initialized to 0. When I copy the result back from the device, I get the result 23623 rather than 1712304. It appears every thread that I want is not created. What am I doing wrong?

Thanks again.

You’re having multiple threads read and write the same memory location.

This won’t work… the writes are NOT serialized. The only promise you get is “at least one write will succeed”.

If you want each thread to increment a value, you want AtomicInc… that’s useful exactly for these cases. It’s slow, since it serializes, but for debugging it can be useful to count things like in your example code.

One other thing, you’ve hardwired your thread-per-block size of 512 into the function. That may be correct now but it’s going to bite you once you change the kernel invocation. It’s safer to use

const unsigned int tid=threadIdx.x+blockDim.x*blockIdx.x;

Thanks SPWorley.

It seems I cannot use the AtomicInc function because I have an 8800 GTS and the compute capability is only 1.0. Is there an easier way around this problem other than coding your own semaphore?

The best way (not the easiest way, though) is to try to formulate the algorithm in a way to avoid the need for global atomic writes entirely. That’s not always possible, but a software semaphore is likely to be slow and/or tricky to get right when the hardware doesn’t support atomic operations.

The contention-free version of an algorithm might appear to be less efficient (redundant reads, multiple passes), but the massive parallelism of the hardware will help hide that.

You can do the same thing without atomics with a little more hassle… look at the reduction SDK sample. That uses log2 N multiple passes to boil down the results from the threads down to one distilled number. It’s overkill for just a simple increment count, but it’ll work on all hardware and isn’t THAT bad. It’s actually FASTER than atomics when all the threads need to contribute and the threads don’t need to get feedback about “which” number they were.

But yeah, it’s a lot more complex than the simple one line increment. Luckily you have code examples, and it’s an important, common, and useful technique to know anyway.