[coding issue] global cbuffer/tex exhausted global decl

Hi,

To use cbuffer or texfetch, we have to declare global-scope vars, such as:

//kernel1.cu:

__constant__ int cbuf1[2000]; 

//below are functions

...

Now i have 10 alternative cu’s, but i’ll use only 1 alternative and its corresponding kernel at each main() call. Which kernel is called depends on the cmdline args:

//main.cu:

#include alt1.cu

...

#include alt10.cu

int main(int argc, char** argv)

{

   switch(argv[1])

     case 1:

       runAlt1();

     ...

     case 10:

       runAlt10();

}

The problem is: I can’t #include Alt1.cu through Alt10.cu: the TOTAL cbuffer size exceeds 64KB. Although i use only one alt, the compiler always try to allocate the global vars for ALL alts.

Therefore, i have to manually comment on/off the #include’s everytime i run some alt, and rebuild all. I can’t therefore use batch file to run data. That’s very time-wasting.

Any suggestions? Thanks a lot!

Only declare 1 buffer (size of your biggest buffer) and fill it with the appropriate values in your main function would be my best bet.

thanks, but

  1. different alts have different data types of the cbuffer, won’t it cause problems if use void cbuf?
  2. all alts use the biggest amount of cbuf, including the alts that don’t use cbuf at all. Won’t it degrade the performance? by how much? thanks!

You mean int or float? Then I think you can use __int_as_float

And the size of the constant buffer is not important, constants are as slow as global memory, what makes them fast is the constant cache (which is only useful if you access the same element in all threads of a warp (or block I forgot)).

If your threads access different values, you are better off using a texture. Which might be even easier for you.

define in common code

texture<float, 1, cudaReadModeElementType> const_float;

texture<int, 1, cudaReadModeElementType> const_int;

Then in your main you can bind the one you are going to use to a lineair memory that you initialize with the right values.

thanks!

1, i mean structures rather than basic types

2, i guess having a big foo const memory has a negative implications for the const cache to schedule the system-default const values such as blockDim.

thanks!

Hmm, how to structures I do not really have an idea other than packing them into float4 and stuff.

I found a thread asking how blockidx, etc are implemented, and this is what mark harris from NVIDIA said:

threadIdx is initially in register R0 of each thread (the x and y components are in the 16 LSBs and MSBs, respectively), not in shared memory. It is placed there by the hardware on invocation of each block. If the kernel doesn’t use threadIdx the compiler may choose to use the register for something else.

blockIdx, blockDim, and gridDim are passed as parameters in shared memory, because all threads in a block will read the same location when these are read.

So filling up your const memory and not using it should not have impact.

If you define the kernels in different compilation units (cu files), you won’t run against the 64kB limit. Constant buffers are only provided to a kernel if they are in the same unit.

Of course, this might give other trouble, as you can’t share anything between compilation units.