I have a real problem with this error: “Formal parameter space overflowed in function …”.
I have a structure of pointers wich I want to pass to the device and the program fails to compile with the above error. the excerpt of the code is like this:
//declaring 27 pointers by a struct. each pointer has at least 6500 floats.
typedef struct{
float* f[27];
}distpointer;
global void kernel(distpointer in_dist, distpointer out_dist)
{
…
//id = fanction of threadid.x , …
for(k=0;K,27;k++){
//do some calculations with (in_dist.f[k])[id] and (out_dist.f[k])[id]
}
…
}
void main()
{
…
for(k=0;K<27;k++){
//allocate and copy in_dist.f[k] to device memory
//allocate and copy out_dist.f[k] to device memory
}
kernel<<<grid, block>>>(in_dist, out_dist)
…
}
The point is when I reduce the number of pointers to 9, the program runs perfectly!!
I’d be so grateful if any one helps me with this.
27sizeof(float)*2 = 216 bytes… 256 bytes are permitted. So…
Maybe declare the structure packed (i.e. no fill bytes between struct members for alignment). That may get their size down.
Print sizeof() struct for comparison before and after.
You may have to figure out how to declare struct packing (and alignment) by inspecting some of the existing CUDA header files. I think they declare this for the builtin vector types as well.
There is a limit of 256 bytes for kernel arguments, and your structures require 54 * sizeof(void *) values combined. On 64 bit platforms, that means 432 bytes for the arguments, which is over the limit and probably explains the error you are seeing. Your two choices are to pass the arguments by pointer rather than by value, and for that you will need to construct the structures in device global memory rather than host memory, or use a constant memory symbol for each structure instead.
Ok and thanks alot, My system was 64bit. I run it on a 32 bit systam and it compiles with no error but this advisory where ever I use my pointers inside the kernel:
“Advisory: cannot tell what pointer points to, assuming global memory space”
The program runs. The first 500 iterations are quite fast and then becomes vey slow(but not becoming slower). and the outputs are terriblly wrong!!
typedef struct{
float* f[27];
}distpointer;
__global__ void kernel(distpointer *in_dist, distpointer *out_dist)
{
...
//id = fanction of threadid.x , ...
distpointer in = *in_dist;
distpointer out = *out_dist;
for(k=0;K,27;k++){
//do some calculations with (in.f[k])[id] and (out.f[k])[id] as "float" values
}
...
}
void main()
{
...
for(k=0;K<27;k++){
//allocate and copy in_dist.f[k] to device memory
//allocate and copy out_dist.f[k] to device memory
}
kernel<<<grid, block>>>(&in_dist, &out_dist)
..
}
the code runs with no error but rediculously very fast and cannot copy back the valuse to host with an error while copying back. Should I cunstruct a structure in device, so how should I do that. you know I’m not such a professional one in programming. would you make it more clear? External Media
I am pretty sure there are errors, you just are not checking for them. The pointers you are passing are host pointers, which are invalid in kernels. If you call cudaGetLastError after the kernel launch I think you will see “invalid device pointer” errors returned and find that the kernel is not running at all.
Yes. Do as you are doing know to make the structure in host memory, then allocate a new structure in device memory, copy the host structure to the device structure and then pass the device structure pointer to your kernel. Something like this might work
distpointer in_dist, * in_dist_, out_dist, * out_dist_;
for(k=0;K<27;k++){
//allocate and copy in_dist.f[k] to device memory
//allocate and copy out_dist.f[k] to device memory
}
cudaMalloc((void **)&in_dist_, sizeof(distpointer));
cudaMalloc((void **)&out_dist_, sizeof(distpointer));
cudaMemcpy(in_dist_, &in_dist, sizeof(distpointer), cudaMemcpyHostToDevice);
cudaMemcpy(out_dist_, &out_dist, sizeof(distpointer), cudaMemcpyHostToDevice);
kernel<<<grid, block>>>(in_dist_, out_dist_)
Thank both of you guys for being so patient and helpful! you two have already tought me much about pointers today!! :rolleyes:
But can I ask you one more question? why does my program running fast on first 500 iterations and then becomes slow?
I only use 250Mb of my 4Gig Tesla C1060!
Well the first step is always getting it to work, and the second it to make it faster. There might be padding/alignment tricks in the structures to improve things. Your code now has to perform two levels of global memory pointer dereferencing to get to data instead of one. In the original reply I posted to you, I suggested using a constant memory symbol to hold the address of the structure, rather than passing it by reference from global memory. You might also be able to have one thread per block do the read of the structure and populate shared memory with the individual pointer addresses (which is essentially what passing by value does anyway).