__device__ int d_nx;
...
...
int main(){
...
...
int nx = 256;
cudaMemcpyToSymbol("d_nx", &nx, sizeof(int), 0, cudaMemcpyHostToDevice);
...
...
}
When I compile my code with -arch sm_13 (to enable doubles) i receive “invalid device symbol” from cudaMemcpyToSymbol. If I change in my code all doubles to floats (and cufftDoubleComplex to cufftComplex) and compile it without -arch sm_13 it works!
I can’t understand what is wrong here. Would appreciate help.
I had this problem when I declared a constant in m.cu and then declared it as extern in main.cu. When i call cudaMemcpyToSymbol in main.cu it gave me the problem. But when I shifted the declaration to main.cu it worked… quite weird.
Indeed you did, because it won’t work. What the original poster is doing is perfectly correct, which means the actual code he is having problems with is different to what he posted.
A slightly fleshed out version:
#include <stdio.h>
__device__ int d_nx;
__global__ void kernel(void)
{
printf("%d\n", d_nx);
};
int main(void)
{
int nx = 256;
cudaError_t stat = cudaMemcpyToSymbol("d_nx", &nx, sizeof(int), 0, cudaMemcpyHostToDevice);
printf("%s\n", cudaGetErrorString(stat));
kernel<<<1,1>>>();
printf("%s\n", cudaGetErrorString(cudaPeekAtLastError()));
return cudaThreadExit();
}
will work flawlessly:
avidday@cuda:~$ nvcc -arch=sm_20 cudasyms.cu -o cudasyms
avidday@cuda:~$ ./cudasyms
no error
no error
256
First – you’re right. What Greg mentioned will not work for quite obvious reason: &d_nx is not available on host since device memory space is completely separate. Am I right?
Second – I take the code that you provided above as a minimal code to reproduce my problem. Without modifications. Pure C-c C-v. Compiled it and got:
<eugene@pde:cufft> nvcc -arch=sm_20 ./test.cu -o test
<eugene@pde:cufft> ./test
invalid device symbol
invalid device function
Just for the record. I have the following:
/Developer/GPU\ Computing/C/bin/darwin/release/deviceQuery/Developer/GPU Computing/C/bin/darwin/release/deviceQuery Starting...
CUDA Device Query (Runtime API) version (CUDART static linking)
There is 1 device supporting CUDA
Device 0: "GeForce 9400M"
CUDA Driver Version: 3.20
CUDA Runtime Version: 3.20
CUDA Capability Major/Minor version number: 1.1
Total amount of global memory: 265945088 bytes
Multiprocessors x Cores/MP = Cores: 2 (MP) x 8 (Cores/MP) = 16 (Cores)
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 16384 bytes
Total number of registers available per block: 8192
Warp size: 32
Maximum number of threads per block: 512
Maximum sizes of each dimension of a block: 512 x 512 x 64
Maximum sizes of each dimension of a grid: 65535 x 65535 x 1
Maximum memory pitch: 2147483647 bytes
Texture alignment: 256 bytes
Clock rate: 1.10 GHz
Concurrent copy and execution: No
Run time limit on kernels: Yes
Integrated: Yes
Support host page-locked memory mapping: Yes
Compute mode: Default (multiple host threads can use this device simultaneously)
Concurrent kernel execution: No
Device has ECC support enabled: No
Device is using TCC driver mode: No
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 3.20, CUDA Runtime Version = 3.20, NumDevs = 1, Device = GeForce 9400M
OK so your problem is a GPU compatibility issue. You can’t run compute 1.3 code on your GPU - it is a compute 1.1 device. The symbol error isn’t coming from you code, it is coming internally from the CUDA runtime library during context establishment.
That is correct. No double precision on your GPU. If you want to write code that will compile and run on both (obvious within the hardware limits), consider using templates. As an example of what I have open in vi right now:
template<typename Real>
__global__ void elementquad2D4(Real *egeom, Real *elks, Real *elms, const unsigned int N)
{
volatile unsigned int tidx = threadIdx.x + blockIdx.x * blockDim.x;
if (tidx >= N) return;
Real fun[4], lder[8], gder[8], gdert[8], jac[4], jacin[4], pd[8], ftf[16], dtpd[16];
Real elk[16], elm[16];
Real * geom = &egeom[tidx*8];
Real * abss = reinterpret_cast<Real *>(eabss);
Real * wght = reinterpret_cast<Real *>(ewght);
Real * funs = reinterpret_cast<Real *>(efuns);
Real * lders = reinterpret_cast<Real *>(elders);
Real * p = reinterpret_cast<Real *>(ep);
.....
which has the floating point type as a template argument. You can the instantiate either or both single and double precision versions, depending on what you or your hardware requires.
Yeah – I have a 17" Macbook Pro, and it has two graphics cards: a “GeForce 9600M GT” (primary, powerhouse) and a “GeForce 9400M” (backup, low-power option). From the control panel I can choose to use the more powerful graphics card and use more energy, or use the lower-performance card and use less power.
Whenever I use the high performance one, I can use CUDA. With the lower one, my CUDA programs fail. I think it’s because it either isn’t set up correctly or isn’t CUDA-capable. If it’s not CUDA-capable, then that would explain why yours doesn’t work.