I am trying to put all a CFD simulation onto a C870 by declaring all the arrays to be used in the functions which are declared as device except for the kernel called from the host. In the kernel a number of arrays are declared which are passed down and processed by a sequence of functions. When I compiled the code I got the following error message ;

“entry function uses too much local memory”

I assume that when a function, either kernel or device , allocates memory without any location specified then the default location used is the small local memory, not the global, whereas if the same arrays are declared and created from the host then all that data has the default location in global memory, and it is then up to me to move frequently used data into shared .


  1. why do I get that error message?
  2. how can I create the arrays declared in the kernel and device functions so that I don’t get the error message?
  3. should I simply allocate and deallocate all device memory from the host?

Yes, having large statically allocated arrays in your kernel functions is generally not a good idea (these will be put in local memory).

You should allocate the arrays using cudaMalloc in you host code, and then pass pointers to the kernel.

so when/can a variable of type device be declared and used? (see new topic entitled device variables)

a device variable can be declared and used with the same rules as any global scope array in C. It should be declared at file scope and the size needs to be evaluated at compile time. Using cudaMalloc on the host allows you to dynamically allocate the correct amount of memory.

So regarding device variables;

  1. they have to be declared as pointers, e.g. you cannot declare them as

type array[N];

but only as

type* array;

with a cudaMalloc allocating the array in host code and/or device code with size provided?

  1. you can declare as

type array[N];

but in order to read the contents of array e.g. to printf, then you must use GetSymbolAddress? How is that done?

  1. what advantages/disadvantages are there to using device variables compared to the standard cudaMalloc and passing pointers to a kernel?

Thanks in advance.


Well, you can do either I suppose. But with the *, you will have to cudaMalloc and then copy the pointer over to the device making even more of a headache. It is easier just to allocate with cudaMalloc and pass the device pointer to the kernel as a parameter.

Example of a device array:

__device__ int d_array[2000];

__global__ void kernel()


    int a = d_array[threadIdx.x];

    // ...


The device array is in the GPU’s device memory. If you try to dereference the device pointer on the host you will segfault or run into other weird problems associated with reading random memory.

To copy the contents of the device array to the host, you have to use cudaGetSymbolAddress to get the device memory pointer and then cudaMemcpy from that device pointer. See the programming guide for the syntax (or search the forums). I’ve never done this before.

As far as I’m concerned, there are no advantages to using a device variable. There are many disadvantages. First, they are global variables and are therefor the root of all evil in OOP programming. Second, accessing them on the host requires more code and is more error prone than if you manage your own device pointers with cudaMalloc. Finally, device arrays are statically sized meaning you have to recompile your program if your problem size changes; Like in old Fortran 77 software, Yuck!