What would stop a function from being called.

I have been experimenting with CUDA to see if it would be useful in a project, however I have ran into an in pass, where my global functions is simply not being called in one of my programs. I was hoping that someone here would know what would stop a global functions from being called and or how to diagnose such a problem.


I am ruing on a windows vista laptop with the 2.0 beta version of CUDA. I manniged to get most of the sample programs to work.

the code that has the problem is:

complexptr w;


	w.imagptr=wi_dev;//these variables are CUDA

	//-----------------------cheak for errors-----------------

	//-----------------------calling the kernals---------------------

	dim3 threadsize(block_size,block_size,block_size);

	dim3 dimGrid ( (wsize0/(threadsize.x)) + ((!(wsize0/(threadsize.x)))?0:1) ,  (wsize1/(threadsize.y)) + ((!(wsize1/(threadsize.y)))?0:1),(wsize2/(threadsize.z)) + ((!(wsize2/(threadsize.z)))?0:1) );

	dim3 pass2gird (1,1,1);

	dim3 pass2threadsize(dimGrid.x*dimGrid.y*dimGrid.z/2,1,1);

	//to stroe the outpout of the 1st pass

	complex* outbfer;

	CUDA_SAFE_CALL_NO_SYNC(cudaMalloc((void**) &outbfer,dimGrid.x*dimGrid.y*dimGrid.z*sizeof(complex)));

	complex* finaloutput;

	CUDA_SAFE_CALL_NO_SYNC(cudaMalloc((void**) &finaloutput,1*sizeof(complex)))


   sincreduce_3d<<<dimGrid,threadsize,threadsize.x*threadsize.y*threadsize.z*sizeof(complex)>>>(outbfer,w,  R, Bx, By, Bz, wsize0,wsize1,wsize2,dim3(rX[i],rY[j],rZ[k]) );

by global function sincreduce_3d, and structure complex and complexptr are defined:

struct complex


	float real;

	float imag;


struct complexptr


	float *realptr;

	float *imagptr;


//NOTE: this was taken from the NVIDA file reduction_kernel.cu this MUST be docmuntead 


//reduces an input complexptr, to an outpout complex pointer, one for each box

__global__ void sincreduce_3d(complex* out,complexptr w, const float R,const float Bx,const float By,const float Bz, const int nx,const int ny,const int nz,dim3 pointanted)


	extern __shared__ complex buffers[];

however when I step thrught the code on VC++ 2005 express edition the global function is not called, and the output variables are not changed. This is in contrast to the other programs I have written in cuda where the debugger has worked.

First, you have to remember that calling functions is asyncronous, you need to wait for them to complete, perhaps with a threadSyncronize() call.

Next, you don’t pass host memory pointers to your device and have the device read or fill them. That memory lives on your CPU.
Instead you must pass DEVICE memory pointers to your device, call the kernel, then pull your results back from the device.

Look at the sample projects and see their use of cudaMemcpy to send info from host to device, then call a kernel, then use cudaMemcpy again to pull the results back.

You may do best by starting with a working project like the Template or Simple* projects and start modifying it one step at a time… you’ll have an easier time getting started.

Each call to a kernel if followed by cudaThreadSynchronize(); As I understand it this should insure that the program runs sequentially. Also I get the bug while running a program in emulation mode which seems to always run sequentially.

All the arrays I send to the kernel are allocated using using cudaMalloc, the only way I access them in the host program is using cudaMemcpy. The other pass by value variables (such as const float) are not alocated. (When I first started I made that mistake, but that did not stop the debugger from stepping through the kernel’s code, nor did it stop the code in the kernel from running. Which suggests that my current problem lies elsewhere.)

Edit: To be safe I have gone over the code, all pointers come from cudaMalloc.

I have been trying to do that, and I have been able to adapt a large amount of code. However, I am at a lost of how to processed when a function that should be called is just being skipped over for no apparent reason.

Simple things first:

  1. Are you sure your machine is configured correctly to run CUDA? E.g., deviceQuery actually agrees that there are devices present?

  2. Have you replaced your big kernel call with a very simple memset kernel or something like that to see if it would call that? There might be some typo in your kernel that is preventing it from actually writing memory.

  3. Are there any kernel launch errors?

I should of thought to look for errors after a kernel call. :argh: I guess I have been spoiled by exceptions.

Now that I have the code in place I get the following message:
“invalid configuration argument.”

It looks like cuda dose not like my grid size of (18,18,201) and block size of (4,4,4), though the documentation(A. 1) seemed to suggest that grid sizes up to 65535 are allowed. Is that number incorrect, or improperly applied here?

thank you for your help, I appreciate it.

Grids can only be 2D (N,M,1)