cuda.h error message

I am porting a c++ project to the GPU to get some speed up in a few bottle necks. I have followed the paradigm set out by nvidia of compiling all the .cpp with the host compilers and then calling the kernel from one of these .cpp files through a wrapper written in .cu which cudaMallocs, cudaMemcpys, and then launches a kernel. The first problem is when run in emulation mode the kernel and all the software runs fine. When I run in device mode the program gets to my first cudaMalloc call and just hangs. I read that the .cpp file which calls the cuda wrapper needs the cuda.h file. When I include this I get :

In file included from /home/lamberh/NVIDIA_GPU_Computing_SDK/C/src/exciton09/source/ATOM_SCF1.cpp:26:
/usr/local/cuda/include/cuda.h:547: error: expected ‘,’ or ‘…’ before ‘(’ token
In file included from /home/lamberh/NVIDIA_GPU_Computing_SDK/C/src/exciton09/source/ATOM_SCF1.cpp:26:
/usr/local/cuda/include/cuda.h:697: error: expected ‘,’ or ‘…’ before ‘(’ token

which, in the cuda.h, corresponds to;
CUresult CUDAAPI cuDeviceGetAttribute(int *pi, CUdevice_attribute attrib, CUdevice dev);

I’m running 2.3 on fedora 11 with gcc 4.4 and the FindCUDA.cmake script.
I’ve attached the .cu which is defines the cuda wrapper and is called by my .cpp file.

Thanks (1.5 KB)

To save a bit of time trying to groc that, the kernel is meant to be very basic. A number is defined on the host passed to the GPU, then initialized to a different variable and passed back to the host.

There are at least two glaring syntax errors in that file I can see which will prevent it from compiling. When I fix them, it builds fine, which makes be think that the error is somewhere in one of your own include files. Certainly there are no errors in cuda.h, if that is what you are implying.

avid@cuda:~$ /opt/cuda/bin/nvcc -arch=sm_13 -c -I/opt/cuda/include -I$HOME/NVIDIA_GPU_Computing_SDK/C/common/inc -o

avid@cuda:~$ cat

#include <stdio.h>

#include <cutil.h>

#include "cuda_runtime_api.h"

#include "cuda.h"

__global__ void integrals_2e_kernel(double* d_number, double* new_number){

*new_number = *d_number;

//printf("new number %f ",new_number);


void kernel_call(){

		int deviceCount;


		int dev;

		printf("There are %d devices supporting CUDA", deviceCount);

		for(dev = 0; dev < deviceCount; dev++){

		cudaDeviceProp deviceProp;

		cudaGetDeviceProperties(&deviceProp, dev);

		printf("\nDevice %d: \"%s\"\n", dev,;



		double number = 4.0;

		size_t size = 1 * int(sizeof(double));

		double* h_number;

		double* d_number;

		double* new_number;

		h_number = &number;

		dim3 dimBlock(1);

		dim3 dimGrid(1);

		CUDA_SAFE_CALL(cudaMalloc((void **)&d_number, size));

		CUDA_SAFE_CALL(cudaMalloc((void **)&new_number, size));

		CUDA_SAFE_CALL(cudaMemcpy(d_number, h_number, size, cudaMemcpyHostToDevice));

		integrals_2e_kernel<<<dimBlock,dimGrid>>>(d_number, new_number);

		printf("Executing GPU kernel...\n");

		CUDA_SAFE_CALL( cudaThreadSynchronize() );

		CUDA_SAFE_CALL(cudaMemcpy(h_number, new_number, size, cudaMemcpyDeviceToHost));



		printf("new number %f\n", *h_number);


Sorry, I didn’t mean to imply that there was an error in the cuda.h file. (1.75 KB)

I have attached a version of the which I compiled and succesfully ran as a stand alone file in device mode using the CUDA SDK makefile. It is when I try and integrate this .cu file with the rest of my c++ source that I have run into trouble. If I run in device emulation mode, calling from a separate .cpp file, everything goes fine. But when I run in device mode the program executes until it reaches the first cudaMalloc in and then hangs indefinitely with CPU spinning at 100%. Also when I try and include cuda.h in the .cpp file in either mode I get the aforementioned error. I’m trying to understand why it runs in emulation mode but hangs at the first cudaMalloc call in device mode.

just as a little more information I wondered if the problem could stem from the linking stage of my build phase:

/usr/lib64/ccache/g++ CMakeFiles/exciton09.dir/source/SCF1.cpp.o CMakeFiles/exciton09.dir/source/INTEGRALS1.cpp.o CMakeFiles/exciton09.dir/source/KPOINTS1.cpp.o CMakeFiles/exciton09.dir/source/INPUT_JOB_CTRL.cpp.o CMakeFiles/exciton09.dir/source/MATRIX_UTIL.cpp.o CMakeFiles/exciton09.dir/source/PAIRS_QUADS.cpp.o CMakeFiles/exciton09.dir/source/TOOLS.cpp.o CMakeFiles/exciton09.dir/source/HEADER.cpp.o CMakeFiles/exciton09.dir/source/MAIN1.cpp.o CMakeFiles/exciton09.dir/source/SYMMETRY.cpp.o CMakeFiles/exciton09.dir/source/MEMORY.cpp.o CMakeFiles/exciton09.dir/source/ATOM_SCF1.cpp.o ./ -o bin/exciton09 -rdynamic /usr/local/cuda/lib64/ -lcuda lib/ /usr/lib64/ /usr/lib64/ -lpthread /usr/local/cuda/lib64/ -lcuda -Wl,-rpath,/usr/local/cuda/lib64:/home/lamberh/NVIDIA_GPU_Computing_SDK/C/src/exciton09/build/lib:

Does the fact that g++ isn’t getting passed -fPIC seem very wrong and possibly responsible for the cudaMalloc hanging?

I really appreciate the help.

You are describing two problems - under some circumstances, you code doesn’t compile, and under others the resulting program hangs. With Cuda 2.3 on Ubuntu 9.04 I can reproduce neither. As best as I can tell, my “fixed” version of the code you originally posted compiles and executes perfectly when linked with a C++ main in another file:

avid@cuda:~$ /opt/cuda/bin/nvcc -arch=sm_13 -c -I/opt/cuda/include -I$HOME/NVIDIA_GPU_Computing_SDK/C/common/inc -o 

avid@cuda:~$ g++ -L/opt/cuda/lib64 -L$HOME/NVIDIA_GPU_Computing_SDK/C/lib -o atoms2.exe -lcutil -lcudart

avid@cuda:~$ LD_LIBRARY_PATH=/opt/cuda/lib64 ./atoms2.exe

There are 2 devices supporting CUDA

Device 0: "GeForce GTX 275"

Device 1: "GeForce GTX 275"

Executing GPU kernel...

new number 4.000000

avid@cuda:~$ cat 

extern "C" {

	void kernel_call();


int main()



	return 0;


And your second posted code also builds and runs perfectly as you noted:

avid@cuda:~$ /opt/cuda/bin/nvcc -arch=sm_13 -I/opt/cuda/include -I$HOME/NVIDIA_GPU_Computing_SDK/C/common/inc -L/opt/cuda/lib64 -L$HOME/NVIDIA_GPU_Computing_SDK/C/lib -o atoms.exe -lcutil -lcudart

avid@cuda:~$ LD_LIBRARY_PATH=/opt/cuda/lib64 ./atoms.exe

There are 2 devices supporting CUDA

Device 0: "GeForce GTX 275"

Device 1: "GeForce GTX 275"

Executing GPU kernel...

new number 5.000000

avid@cuda:~$ uname -a

Linux cuda 2.6.28-15-generic #52-Ubuntu SMP Wed Sep 9 10:48:52 UTC 2009 x86_64 GNU/Linux

Right. My original posts were a little ambiguous. I should identify my major problem as this:

My .cu code runs fine as a stand alone, and when integrated with my .cpp files it runs fine in device emulation mode as we’ve noted. It is when I try to run in device mode that it hangs at the first cudaMalloc call. I am unsure of where to begin debugging this hanging call to cudaMalloc. My suspicion is that it there is either a problem with the code itself, the way I’m linking my files at compile time or this is related to unsupported gcc 4.4.2/Fedora 11 and I’ll need either a work around or to step back to F10.

Are those fair conclusions or is this asking a bit much to get people without access to my setup to speculate?


Try building an executable using the two source files and commands I posted and see whether it works as a first step. If it does, then it is either your code, or your build procedure, and you can eliminate your tool chain as a variable.

Great. I followed your advice to eliminate the tool chain as a variable by successfully compiling and linking that small and main program. Then I went through and double checked my include files so that the cuda wrapper function was defined as extern C… finally I brought the kernel_call out of one of the .cpp sub rountine I had it in and put it front and center in the main file. This allowed me to run and compile sucessfully. Then I moved this cuda wrapper down in the code in the main file and I got the same hanging cudaMalloc problem at device run time. After some tinkering I noticed that if I called my kernel before this statement which opens a file defined on the command line for data ouput: file.out = fopen(strcat(argv[2], yy), “w”) everything ran fine. But if I called the kernel after that fopen that’s when cudaMalloc hanged.

I stripped out the
file.out = fopen(argv[2], “w”);

and now I can call that wrapper/kernel invocation from anywhere in my code. I was wondering what might cause that behaviour?

Anyways, thanks very much for the help I was stuck there.

No surprises there. strcat will blindly append onto the end of the argv. Who knows how much space has been reserved in argv and what lies after it in the process memory map (even if there is theoretically space, if there are not trailing \0 characters it can easily run away). That code snippet is the very definition of a buffer overflow. No doubt something critical to the CUDA context is getting hosed by the strcat.

Thanks you guys are on the ball.