Formal parameter space overflowed kernel launch error

Amin_S · April 26, 2010, 8:05am

Hi

I have a real problem with this error: “Formal parameter space overflowed in function …”.
I have a structure of pointers wich I want to pass to the device and the program fails to compile with the above error. the excerpt of the code is like this:

//declaring 27 pointers by a struct. each pointer has at least 6500 floats.
typedef struct{
float* f[27];
}distpointer;

global void kernel(distpointer in_dist, distpointer out_dist)
{
…
//id = fanction of threadid.x , …
for(k=0;K,27;k++){
//do some calculations with (in_dist.f[k])[id] and (out_dist.f[k])[id]
}
…
}

void main()
{
…
for(k=0;K<27;k++){
//allocate and copy in_dist.f[k] to device memory
//allocate and copy out_dist.f[k] to device memory
}

 kernel<<<grid, block>>>(in_dist, out_dist)

…
}

The point is when I reduce the number of pointers to 9, the program runs perfectly!!
I’d be so grateful if any one helps me with this.

cbuchner1 · April 26, 2010, 8:38am

27sizeof(float)*2 = 216 bytes… 256 bytes are permitted. So…

Maybe declare the structure packed (i.e. no fill bytes between struct members for alignment). That may get their size down.
Print sizeof() struct for comparison before and after.

You may have to figure out how to declare struct packing (and alignment) by inspecting some of the existing CUDA header files. I think they declare this for the builtin vector types as well.

Christian

avidday · April 26, 2010, 8:41am

There is a limit of 256 bytes for kernel arguments, and your structures require 54 * sizeof(void *) values combined. On 64 bit platforms, that means 432 bytes for the arguments, which is over the limit and probably explains the error you are seeing. Your two choices are to pass the arguments by pointer rather than by value, and for that you will need to construct the structures in device global memory rather than host memory, or use a constant memory symbol for each structure instead.

cbuchner1 · April 26, 2010, 8:45am

Good point! I still have a 32 bit world view here.

Amin_S · April 26, 2010, 9:49am

Ok and thanks alot, My system was 64bit. I run it on a 32 bit systam and it compiles with no error but this advisory where ever I use my pointers inside the kernel:
“Advisory: cannot tell what pointer points to, assuming global memory space”
The program runs. The first 500 iterations are quite fast and then becomes vey slow(but not becoming slower). and the outputs are terriblly wrong!!

Amin_S · April 29, 2010, 2:03pm

What do you exactly mean by passing values instead of pointers. I suppose I’m passing pointers right now!!

avidday · April 29, 2010, 2:19pm

The arguments to your kernel function are values, not pointers.

This is passing by value:

__global__ void kernel(distpointer in_dist, distpointer out_dist)

This is passing by pointer:

__global__ void kernel(distpointer *in_dist, distpointer *out_dist)

In one case the argument list is 432 bytes long, in the other it is 16 bytes long. A subtle but important difference…

cbuchner1 · April 29, 2010, 2:45pm

To be more precise, it’s an array of pointers embedded in a struct type - which is being passed by value. ;)

Amin_S · April 29, 2010, 3:31pm

Well, well, well…

I got the point; array of pointers!!! External Media

So, Is there any furthere changes needed to add in the kernel?

And by the way, does the heavy arguments affect the speed of execution (I mean when they are around 200 byte)?

Amin_S · April 30, 2010, 9:41am

The arguments to your kernel function are values, not pointers.

This is passing by value:
__global__ void kernel(distpointer in_dist, distpointer out_dist)
This is passing by pointer:
__global__ void kernel(distpointer *in_dist, distpointer *out_dist)
In one case the argument list is 432 bytes long, in the other it is 16 bytes long. A subtle but important difference…

Ok, I changed the code as below:

typedef struct{

 float* f[27];

}distpointer;

__global__ void kernel(distpointer *in_dist, distpointer *out_dist)

{

	 ...

	 //id = fanction of threadid.x , ...

	 distpointer in   = *in_dist;

	 distpointer out = *out_dist;

	 for(k=0;K,27;k++){

		   //do some calculations with (in.f[k])[id] and (out.f[k])[id] as "float" values

	 }

	 ...

}

void main()

{

...

	 for(k=0;K<27;k++){

			//allocate and copy in_dist.f[k] to device memory

			//allocate and copy out_dist.f[k] to device memory

	 }

	 kernel<<<grid, block>>>(&in_dist, &out_dist)

..

}

the code runs with no error but rediculously very fast and cannot copy back the valuse to host with an error while copying back. Should I cunstruct a structure in device, so how should I do that. you know I’m not such a professional one in programming. would you make it more clear? External Media

avidday · April 30, 2010, 10:41am

I am pretty sure there are errors, you just are not checking for them. The pointers you are passing are host pointers, which are invalid in kernels. If you call cudaGetLastError after the kernel launch I think you will see “invalid device pointer” errors returned and find that the kernel is not running at all.

Yes. Do as you are doing know to make the structure in host memory, then allocate a new structure in device memory, copy the host structure to the device structure and then pass the device structure pointer to your kernel. Something like this might work

distpointer in_dist, * in_dist_, out_dist, * out_dist_;

	 for(k=0;K<27;k++){

			//allocate and copy in_dist.f[k] to device memory

			//allocate and copy out_dist.f[k] to device memory

	 }

	 cudaMalloc((void **)&in_dist_, sizeof(distpointer));

	 cudaMalloc((void **)&out_dist_, sizeof(distpointer));

	 cudaMemcpy(in_dist_, &in_dist, sizeof(distpointer), cudaMemcpyHostToDevice);

	 cudaMemcpy(out_dist_, &out_dist, sizeof(distpointer), cudaMemcpyHostToDevice);

	 

	 kernel<<<grid, block>>>(in_dist_, out_dist_)

Amin_S · April 30, 2010, 11:57am

Thanks alot. I’ll check it :mellow:

Amin_S · April 30, 2010, 12:21pm

well, now in the kernel, How should I convert the pointer of structure to floats. I’m doing like this now:

__global__ void kernel(distpointer *in_dist, distpointer *out_dist)

{

	 ...

	 //id = fanction of threadid.x , ...

-->	 distpointer in   = *in_dist;

-->	 distpointer out = *out_dist;

	 for(k=0;K,27;k++){

		   //do some calculations with (in.f[k])[id] and (out.f[k])[id] as "float" values. for example:

						float f_local_0 = (in.f[0])[id];

	 }

	 ...

}

And the compiler returns the following error:

“no suitable user-defined conversion from “distPointer” to “distPointer” exists”.

How can I use these structure pointers in the kernel?

cbuchner1 · April 30, 2010, 12:23pm

use in_dist->f[k] and drop “in”.

Or rename in_dist to in and use in->f[k] which is more concise.

And learn about pointers…

Amin_S · April 30, 2010, 12:53pm

OOOOOOOK, Its workin!!! External Image

Thank both of you guys for being so patient and helpful! you two have already tought me much about pointers today!! :rolleyes:

But can I ask you one more question? why does my program running fast on first 500 iterations and then becomes slow?
I only use 250Mb of my 4Gig Tesla C1060!

MisterAnderson42 · April 30, 2010, 1:21pm

Kernel launches are asynchronous

If you time with

t1 = wall_clock()

kernel<<<,,,>>>()

t2 = wall_clock()

t2 - t1 will be approximately 0.

If you put many kernel calls in between time measurements, eventually the queue will fill up and you will start timing the kernel launches.

To properly time cuda kernels with wall clock measurements, you must call cudaThreadSyncrhonize() before every wall clock time reading.

Amin_S · May 1, 2010, 5:16am

You know, the program becomes very slow on 32bit system compared to when I run it with arguments as values.

avidday · May 1, 2010, 9:02am

Well the first step is always getting it to work, and the second it to make it faster. There might be padding/alignment tricks in the structures to improve things. Your code now has to perform two levels of global memory pointer dereferencing to get to data instead of one. In the original reply I posted to you, I suggested using a constant memory symbol to hold the address of the structure, rather than passing it by reference from global memory. You might also be able to have one thread per block do the read of the structure and populate shared memory with the individual pointer addresses (which is essentially what passing by value does anyway).

Try a few ideas out and see which one is fastest.

Amin_S · May 2, 2010, 10:35am

Thanks avidday! I’ll try it and tell you the outcome.

Topic		Replies	Views
Formal Parameter Space Overflowed error CUDA Programming and Performance	13	6133	October 14, 2011
Pointer as formal parameter in kernel call CUDA Programming and Performance	8	6374	March 13, 2009
Kernel Formal parameter space overflowed CUDA Programming and Performance	4	2580	July 31, 2009
Multiple array inputs to Kernel Formal parameter space overflowed in function Error CUDA Programming and Performance	1	1294	November 3, 2008
Pointers in formal parameter list? I don't understand... CUDA Programming and Performance	6	2685	November 17, 2008
Confusion regarding register use and occupancy calculator CUDA Programming and Performance	22	2935	June 18, 2010
Problem with kernel launch too many paremeters, trying to use a struct CUDA Programming and Performance	4	1085	July 11, 2011
Parameter passing bug in CUDA 2.0 x86_64 CUDA 2.0 compiler, parameter passing CUDA Programming and Performance	8	5127	January 27, 2009
Can a Kernel be too big?? CUDA_ERROR_NO_BINARY_FOR_GPU error 209 CUDA Programming and Performance	11	2980	November 13, 2017
Error: Formal parameter space overflowed Legacy PGI Compilers	3	8792	June 10, 2010

Formal parameter space overflowed kernel launch error

Related topics