cuMemAlloc() How to use in device

Pittsburgh · June 28, 2008, 1:41am

I am trying to use cuMemAlloc in a device function and I am having no success. Here is my code:

struct Combination

{

	long n;

    long k;

	long* data;

};

__device__ void Comb_Construct(long n, long k, Combination* c)

{

  c->n = n;

  c->k = k;

 cuMemAlloc(&c->data, 5*sizeof(long));

}

The error I get is:

comb.cu(13): error:

1> identifier “cuMemAlloc” is undefined

I have this code in a comb.cu file that I include in my kernel.cu file. How is it that cuMemAlloc is undefined? What am I doing wrong? The CUDA programming manual doesn’t provide any clues to my problem.

Thanks.

MisterAnderson42 · June 28, 2008, 2:23am

cuMemAlloc is a host function. It cannot be called on the device.

Pittsburgh · June 28, 2008, 8:31pm

Thanks. I also have a few questions regarding memory allocation.

Is cuMemAlloc() the same as cudaMalloc()?
I am using cudaMalloc on the host code to allocate memory on the device. Does this allocate memory on the global or shared memory of the device?

The reason I ask is because there are many different things I am allocating on the device. In particular, one 96 byte piece of data and another piece of data that is over 34MB. Since the 34MB will not fit into shared memory, I want that to be in global memory and the other 96 byte to be in shared memory for every block (essentially copied).

How do I know how many registers my code is using? Right now either I am exceeding shared memory or my register count because I get the following compile error:

Cuda error: Kernel execution failed in file ‘template.cu’ in line 221 : too many
resources requested for launch.

Thanks. I appreciate any help.

MisterAnderson42 · June 28, 2008, 10:55pm

As far as I know, they are the same.

Global.

Shared memory is per block and limited to 16 KiB. You can allocate shared memory statically by declaring an array of a compile time known size (i.e. shared int sdata[1000]) or dynamically at run time by specifying the third parameter to the kernel call (see the programming guide for details).

You obviously cannot copy all 34MiB to the shared memory of a single block at once. Hopefully, each block only needs a portion of that memory, in which case you should only load the relavent sections. If every block needs to go over the entire 34 MiB, your best bet is a sliding window technique (load in a chunk, have all threads in the block process it then move on to the next chunk).

Compile with the -cubin or -keep argument and examine the cubin file. It will tell you how many registers per thread are being used by your kernel. This error message might also result from requesting too much shared memory per block in the launch configuration.

Pittsburgh · June 28, 2008, 11:06pm

Thanks MisterAnderson. I have another question.

I am trying to generate 1,712,304 threads. So for my kernel function call I have:

kernelfunction<<< 3345, 512 >>>(d_count);

This gives me a little over the number of threads that I want, but that is OK because in my kernel function I have the following:

__global__ void kernelfunction(unsigned int* g_count)

{

   unsigned int tid = blockIdx.x*512 + threadIdx.x;

   if (tid < 1712304)

    {

     *g_count = *g_count + 1;

    }

  __syncthreads();

}

This is just test code I put in to see if I am getting all of the threads that I want. In my host code d_count is initialized to 0. When I copy the result back from the device, I get the result 23623 rather than 1712304. It appears every thread that I want is not created. What am I doing wrong?

Thanks again.

SPWorley · June 28, 2008, 11:32pm

Thanks MisterAnderson. I have another question.

I am trying to generate 1,712,304 threads. So for my kernel function call I have:
kernelfunction<<< 3345, 512 >>>(d_count);
This gives me a little over the number of threads that I want, but that is OK because in my kernel function I have the following:
__global__ void kernelfunction(unsigned int* g_count)

{

   unsigned int tid = blockIdx.x*512 + threadIdx.x;

   if (tid < 1712304)

    {

     *g_count = *g_count + 1;

    }

  __syncthreads();

}

You’re having multiple threads read and write the same memory location.

This won’t work… the writes are NOT serialized. The only promise you get is “at least one write will succeed”.

If you want each thread to increment a value, you want AtomicInc… that’s useful exactly for these cases. It’s slow, since it serializes, but for debugging it can be useful to count things like in your example code.

One other thing, you’ve hardwired your thread-per-block size of 512 into the function. That may be correct now but it’s going to bite you once you change the kernel invocation. It’s safer to use

const unsigned int tid=threadIdx.x+blockDim.x*blockIdx.x;

Pittsburgh · June 29, 2008, 1:55am

Thanks SPWorley.

It seems I cannot use the AtomicInc function because I have an 8800 GTS and the compute capability is only 1.0. Is there an easier way around this problem other than coding your own semaphore?

seibert · June 29, 2008, 2:38am

The best way (not the easiest way, though) is to try to formulate the algorithm in a way to avoid the need for global atomic writes entirely. That’s not always possible, but a software semaphore is likely to be slow and/or tricky to get right when the hardware doesn’t support atomic operations.

The contention-free version of an algorithm might appear to be less efficient (redundant reads, multiple passes), but the massive parallelism of the hardware will help hide that.

SPWorley · June 29, 2008, 3:37am

You can do the same thing without atomics with a little more hassle… look at the reduction SDK sample. That uses log2 N multiple passes to boil down the results from the threads down to one distilled number. It’s overkill for just a simple increment count, but it’ll work on all hardware and isn’t THAT bad. It’s actually FASTER than atomics when all the threads need to contribute and the threads don’t need to get feedback about “which” number they were.

But yeah, it’s a lot more complex than the simple one line increment. Luckily you have code examples, and it’s an important, common, and useful technique to know anyway.

Topic		Replies	Views
Allocate Memory within Device? is it possible? CUDA Programming and Performance	5	3642	November 24, 2008
Allocating space in global memory from device CUDA Programming and Performance	1	1017	May 10, 2009
what kind of memory does cudaMalloc malloc ? CUDA Programming and Performance	2	3077	August 9, 2007
cuMemAlloc questions CUDA Programming and Performance	1	2430	January 29, 2010
cudamalloc not allocating memeory CUDA Programming and Performance	0	1288	May 1, 2012
question about memory allocation CUDA Programming and Performance	1	1632	October 16, 2007
dynamically global memory allocation in __global__ or __device__ function? CUDA Programming and Performance	2	6323	November 17, 2009
Differences between cudaMalloc() and cuMemAlloc() CUDA Programming and Performance	0	2701	June 17, 2009
Question Dynamic Memory Allocation in the kernel function CUDA Programming and Performance	2	3665	November 30, 2009
cudaMemcpy / cudaMalloc CUDA Programming and Performance	1	4762	April 16, 2008

cuMemAlloc() How to use in __device__

Related topics

cuMemAlloc() How to use in device