Formal parameter space overflowed kernel launch error

Hi

I have a real problem with this error: “Formal parameter space overflowed in function …”.
I have a structure of pointers wich I want to pass to the device and the program fails to compile with the above error. the excerpt of the code is like this:

//declaring 27 pointers by a struct. each pointer has at least 6500 floats.
typedef struct{
float* f[27];
}distpointer;

global void kernel(distpointer in_dist, distpointer out_dist)
{

//id = fanction of threadid.x , …
for(k=0;K,27;k++){
//do some calculations with (in_dist.f[k])[id] and (out_dist.f[k])[id]
}

}

void main()
{

for(k=0;K<27;k++){
//allocate and copy in_dist.f[k] to device memory
//allocate and copy out_dist.f[k] to device memory
}

 kernel<<<grid, block>>>(in_dist, out_dist)


}

The point is when I reduce the number of pointers to 9, the program runs perfectly!!
I’d be so grateful if any one helps me with this.

27sizeof(float)*2 = 216 bytes… 256 bytes are permitted. So…

Maybe declare the structure packed (i.e. no fill bytes between struct members for alignment). That may get their size down.
Print sizeof() struct for comparison before and after.

You may have to figure out how to declare struct packing (and alignment) by inspecting some of the existing CUDA header files. I think they declare this for the builtin vector types as well.

Christian

There is a limit of 256 bytes for kernel arguments, and your structures require 54 * sizeof(void *) values combined. On 64 bit platforms, that means 432 bytes for the arguments, which is over the limit and probably explains the error you are seeing. Your two choices are to pass the arguments by pointer rather than by value, and for that you will need to construct the structures in device global memory rather than host memory, or use a constant memory symbol for each structure instead.

Good point! I still have a 32 bit world view here.

Ok and thanks alot, My system was 64bit. I run it on a 32 bit systam and it compiles with no error but this advisory where ever I use my pointers inside the kernel:
“Advisory: cannot tell what pointer points to, assuming global memory space”
The program runs. The first 500 iterations are quite fast and then becomes vey slow(but not becoming slower). and the outputs are terriblly wrong!!

What do you exactly mean by passing values instead of pointers. I suppose I’m passing pointers right now!!

The arguments to your kernel function are values, not pointers.

This is passing by value:

__global__ void kernel(distpointer in_dist, distpointer out_dist)

This is passing by pointer:

__global__ void kernel(distpointer *in_dist, distpointer *out_dist)

In one case the argument list is 432 bytes long, in the other it is 16 bytes long. A subtle but important difference…

To be more precise, it’s an array of pointers embedded in a struct type - which is being passed by value. ;)

Well, well, well…

I got the point; array of pointers!!! External Media

So, Is there any furthere changes needed to add in the kernel?

And by the way, does the heavy arguments affect the speed of execution (I mean when they are around 200 byte)?

Ok, I changed the code as below:

typedef struct{

 float* f[27];

}distpointer;

__global__ void kernel(distpointer *in_dist, distpointer *out_dist)

{

	 ...

	 //id = fanction of threadid.x , ...

	 distpointer in   = *in_dist;

	 distpointer out = *out_dist;

	 for(k=0;K,27;k++){

		   //do some calculations with (in.f[k])[id] and (out.f[k])[id] as "float" values

	 }

	 ...

}

void main()

{

...

	 for(k=0;K<27;k++){

			//allocate and copy in_dist.f[k] to device memory

			//allocate and copy out_dist.f[k] to device memory

	 }

	 kernel<<<grid, block>>>(&in_dist, &out_dist)

..

}

the code runs with no error but rediculously very fast and cannot copy back the valuse to host with an error while copying back. Should I cunstruct a structure in device, so how should I do that. you know I’m not such a professional one in programming. would you make it more clear? External Media

I am pretty sure there are errors, you just are not checking for them. The pointers you are passing are host pointers, which are invalid in kernels. If you call cudaGetLastError after the kernel launch I think you will see “invalid device pointer” errors returned and find that the kernel is not running at all.

Yes. Do as you are doing know to make the structure in host memory, then allocate a new structure in device memory, copy the host structure to the device structure and then pass the device structure pointer to your kernel. Something like this might work

distpointer in_dist, * in_dist_, out_dist, * out_dist_;

	 for(k=0;K<27;k++){

			//allocate and copy in_dist.f[k] to device memory

			//allocate and copy out_dist.f[k] to device memory

	 }

	 cudaMalloc((void **)&in_dist_, sizeof(distpointer));

	 cudaMalloc((void **)&out_dist_, sizeof(distpointer));

	 cudaMemcpy(in_dist_, &in_dist, sizeof(distpointer), cudaMemcpyHostToDevice);

	 cudaMemcpy(out_dist_, &out_dist, sizeof(distpointer), cudaMemcpyHostToDevice);

	 

	 kernel<<<grid, block>>>(in_dist_, out_dist_)

Thanks alot. I’ll check it :mellow:

well, now in the kernel, How should I convert the pointer of structure to floats. I’m doing like this now:

__global__ void kernel(distpointer *in_dist, distpointer *out_dist)

{

	 ...

	 //id = fanction of threadid.x , ...

-->	 distpointer in   = *in_dist;

-->	 distpointer out = *out_dist;

	 for(k=0;K,27;k++){

		   //do some calculations with (in.f[k])[id] and (out.f[k])[id] as "float" values. for example:

						float f_local_0 = (in.f[0])[id];

	 }

	 ...

}

And the compiler returns the following error:

“no suitable user-defined conversion from “distPointer” to “distPointer” exists”.

How can I use these structure pointers in the kernel?

use in_dist->f[k] and drop “in”.

Or rename in_dist to in and use in->f[k] which is more concise.

And learn about pointers…

OOOOOOOK, Its workin!!! External Image

Thank both of you guys for being so patient and helpful! you two have already tought me much about pointers today!! :rolleyes:

But can I ask you one more question? why does my program running fast on first 500 iterations and then becomes slow?
I only use 250Mb of my 4Gig Tesla C1060!

Kernel launches are asynchronous

If you time with

t1 = wall_clock()

kernel<<<,,,>>>()

t2 = wall_clock()

t2 - t1 will be approximately 0.

If you put many kernel calls in between time measurements, eventually the queue will fill up and you will start timing the kernel launches.

To properly time cuda kernels with wall clock measurements, you must call cudaThreadSyncrhonize() before every wall clock time reading.

You know, the program becomes very slow on 32bit system compared to when I run it with arguments as values.

Well the first step is always getting it to work, and the second it to make it faster. There might be padding/alignment tricks in the structures to improve things. Your code now has to perform two levels of global memory pointer dereferencing to get to data instead of one. In the original reply I posted to you, I suggested using a constant memory symbol to hold the address of the structure, rather than passing it by reference from global memory. You might also be able to have one thread per block do the read of the structure and populate shared memory with the individual pointer addresses (which is essentially what passing by value does anyway).

Try a few ideas out and see which one is fastest.

Thanks avidday! I’ll try it and tell you the outcome.