cudaMemcpyToSymbol returnes "invalid device symbol"

Dear Forum.

I have the following piece of code

__device__ int d_nx;



int main(){



 int nx = 256;

 cudaMemcpyToSymbol("d_nx", &nx, sizeof(int), 0, cudaMemcpyHostToDevice);




When I compile my code with -arch sm_13 (to enable doubles) i receive “invalid device symbol” from cudaMemcpyToSymbol. If I change in my code all doubles to floats (and cufftDoubleComplex to cufftComplex) and compile it without -arch sm_13 it works!

I can’t understand what is wrong here. Would appreciate help.



I had this problem when I declared a constant in and then declared it as extern in When i call cudaMemcpyToSymbol in it gave me the problem. But when I shifted the declaration to it worked… quite weird.

Definitely not my case. All my code is in one .cu file. Thanks though…



why do you use cudaMemcpyToSymbol instead of cudamemcpy ??

I believe cudaMemcpyToSymbol is out there to send data to gpu’s constant memory.

I’ve never tried it but…

cudaMemcpy(&d_nx, &nx, sizeof(int), cudaMemcpyHostToDevice);

Don’t copy-past it, as I wrote it without checking.

The other thing is, do you have any namespaces ? If so you need to add it to your string, e.g. “nameSpace::d_nx”.

Good luck,


Indeed you did, because it won’t work. What the original poster is doing is perfectly correct, which means the actual code he is having problems with is different to what he posted.

A slightly fleshed out version:

#include <stdio.h>

__device__ int d_nx;

__global__ void kernel(void)


    printf("%d\n", d_nx);


int main(void)


    int nx = 256;

    cudaError_t stat = cudaMemcpyToSymbol("d_nx", &nx, sizeof(int), 0, cudaMemcpyHostToDevice);

    printf("%s\n", cudaGetErrorString(stat));


    printf("%s\n", cudaGetErrorString(cudaPeekAtLastError()));

return cudaThreadExit();


will work flawlessly:

avidday@cuda:~$ nvcc -arch=sm_20 -o cudasyms

avidday@cuda:~$ ./cudasyms 

no error

no error


Hey avidday. Thanks for reply.

First – you’re right. What Greg mentioned will not work for quite obvious reason: &d_nx is not available on host since device memory space is completely separate. Am I right?

Second – I take the code that you provided above as a minimal code to reproduce my problem. Without modifications. Pure C-c C-v. Compiled it and got:

<eugene@pde:cufft> nvcc -arch=sm_20 ./ -o test

<eugene@pde:cufft> ./test 

invalid device symbol

invalid device function

Just for the record. I have the following:

/Developer/GPU\ Computing/C/bin/darwin/release/deviceQuery/Developer/GPU Computing/C/bin/darwin/release/deviceQuery Starting...

CUDA Device Query (Runtime API) version (CUDART static linking)

There is 1 device supporting CUDA

Device 0: "GeForce 9400M"

  CUDA Driver Version:                           3.20

  CUDA Runtime Version:                          3.20

  CUDA Capability Major/Minor version number:    1.1

  Total amount of global memory:                 265945088 bytes

  Multiprocessors x Cores/MP = Cores:            2 (MP) x 8 (Cores/MP) = 16 (Cores)

  Total amount of constant memory:               65536 bytes

  Total amount of shared memory per block:       16384 bytes

  Total number of registers available per block: 8192

  Warp size:                                     32

  Maximum number of threads per block:           512

  Maximum sizes of each dimension of a block:    512 x 512 x 64

  Maximum sizes of each dimension of a grid:     65535 x 65535 x 1

  Maximum memory pitch:                          2147483647 bytes

  Texture alignment:                             256 bytes

  Clock rate:                                    1.10 GHz

  Concurrent copy and execution:                 No

  Run time limit on kernels:                     Yes

  Integrated:                                    Yes

  Support host page-locked memory mapping:       Yes

  Compute mode:                                  Default (multiple host threads can use this device simultaneously)

  Concurrent kernel execution:                   No

  Device has ECC support enabled:                No

  Device is using TCC driver mode:               No

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 3.20, CUDA Runtime Version = 3.20, NumDevs = 1, Device = GeForce 9400M

my OS= mac OSX 10.6.6

Any suggestions?



What happens if you compile with -arch=sm_11 to compile for the GPU you have?

OK so your problem is a GPU compatibility issue. You can’t run compute 1.3 code on your GPU - it is a compute 1.1 device. The symbol error isn’t coming from you code, it is coming internally from the CUDA runtime library during context establishment.

If I compile my code with -arch=sm_11 I get:

ptxas /tmp/tmpxft_0000505e_00000000-2_cufft_test_double.ptx, line 80; warning : Double is not supported. Demoting to float

And it seem to work. Besides the fact that I get single precision instead of double of course.



Thanks, avidday.

Does it mean that on my device I cannot use double arithmetic?

I do coding on my laptop but once I’m done I will run this code on Tesla S2050 devices. I just did deviceQuery on the machine that has it and I see:

CUDA Capability Major/Minor version number:    2.0

Thank you,


That is correct. No double precision on your GPU. If you want to write code that will compile and run on both (obvious within the hardware limits), consider using templates. As an example of what I have open in vi right now:

template<typename Real>

__global__ void elementquad2D4(Real *egeom, Real *elks, Real *elms, const unsigned int N)


	volatile unsigned int tidx = threadIdx.x +  blockIdx.x * blockDim.x;

	if (tidx >= N) return;

	Real fun[4], lder[8], gder[8], gdert[8], jac[4], jacin[4], pd[8], ftf[16], dtpd[16];

	Real elk[16], elm[16];

	Real * geom = &egeom[tidx*8];

	Real * abss = reinterpret_cast<Real *>(eabss);

	Real * wght = reinterpret_cast<Real *>(ewght);

	Real * funs  = reinterpret_cast<Real *>(efuns);

	Real * lders = reinterpret_cast<Real *>(elders);

	Real * p = reinterpret_cast<Real *>(ep);


which has the floating point type as a template argument. You can the instantiate either or both single and double precision versions, depending on what you or your hardware requires.

Thank you.

Topic is closed.

Yeah – I have a 17" Macbook Pro, and it has two graphics cards: a “GeForce 9600M GT” (primary, powerhouse) and a “GeForce 9400M” (backup, low-power option). From the control panel I can choose to use the more powerful graphics card and use more energy, or use the lower-performance card and use less power.

Whenever I use the high performance one, I can use CUDA. With the lower one, my CUDA programs fail. I think it’s because it either isn’t set up correctly or isn’t CUDA-capable. If it’s not CUDA-capable, then that would explain why yours doesn’t work.

Hope this helps!