I had a kernel that needs more than the 48k+16kcache of each SM of a K40

Dear All

 I have to launch a kernel with 128 threads. But I declare variables in all threads summed more than the 64k available. Which policies I must follow to have the best performance?

Thanks

Luis Gonçalves

If your variables won’t fit in shared memory, then in general you must use global memory.

You should follow the basic global memory optimization rules, such as data arrangement and access for coalesced loads. There are many webinars on this topic, here is one:

http://on-demand.gputechconf.com/gtc-express/2011/presentations/cuda_webinars_GlobalMemory.pdf

What happens if I declare more memory space than the 48k+16k? The code deals good (automatically) with that or is better as much as possible I have to take in account with that?

Thanks

Luis Gonçalves

The 48K on each SM that you are referring to is shared memory. If you declare more than 48K of shared memory, you will get either compile errors (static definition of size) or runtime errors at kernel launch (dynamic definition of shared memory size).

No, if inside the kernel I declare more than 64K (summing the memory of a block of threads). Did the code maps those variables to the GEM? Is it better to handle my self those maping or allocation?

Thanks

Luis Gonçalves

Could you show an example (CUDA source code) that demonstrates what you are currently doing or are trying to accomplish? It is very hard to tell from the vague descriptions above.

global void process1(int nusersvirt, int iter,double *real_codigo,double *imag_codigo,double *fft_out_real,double *fft_out_imag,
double *fft_in_real,double *fft_in_imag,int nredund,double eqin)
{
int banda1=threadIdx.x;
int brwidth=blockDim.x;
int i,red,form1,INFO=0;
double var1[8];
complex1 var2[6];
double coef[18];
complex1 a[171];
complex1 RO[18];
int form,z,z1,banda4;
int banda3,ind2;
int shift=0;
int index1;
div_t divres;
int baixo;
complex1 FR[MAXUSERS
MAXREDUND];


#pragma omp parallel num_threads(16)
{
process1<<<1,128>>>(z7, z5,real_codigo+2560*(z7nr),imag_codigo+2560(z7nr),outreal1,outimag1,(double )(frame1+512+sizeof(double)(NRSAMPLESnr2)),(double )(frame1+512+sizeof(double)(NRSAMPLES(nr*2+1))),br,eqin);
}


typedef struct
{
double r;
double i;
} complex1;

MAXUSERSMAXREDUND=1618


The GPU is a K40

Thanks

Luis Gonçalves

These are all pointers to global memory space:

(…, double *real_codigo,double *imag_codigo,double *fft_out_real,double *fft_out_imag,
double *fft_in_real,double *fft_in_imag,… double *eqin)

I wouldn’t make any “automatic” assumptions about behavior there. Follow the principles outlined in the previous presentation I linked.

These are all definitions residing in local memory space:

int banda1=threadIdx.x;
int brwidth=blockDim.x;
int i,red,form1,INFO=0;
double var1[8];
complex1 var2[6];
double coef[18];
complex1 a[171];
complex1 RO[18];
int form,z,z1,banda4;
int banda3,ind2;
int shift=0;
int index1;
div_t divres;
int baixo;
complex1 FR[MAXUSERS*MAXREDUND];

Some of these will end up being “register” variables. There’s not any obvious further steps I would suggest to further optimize access to these:

int banda1=threadIdx.x;
int brwidth=blockDim.x;
int i,red,form1,INFO=0;
int form,z,z1,banda4;
int banda3,ind2;
int shift=0;
int index1;
int baixo;

For the remainder, there’s a good chance some or most of those will end up being backed by storage in on-board (DRAM) memory (which is the same backing for ordinary global memory). For those, there is an underlying access pattern rule that can be applied. Specifically, coalesced on-board memory access will occur for a local memory variable being retrieved from on-board memory, if the threads of a warp access the array values at a given index. That means an access like:

double my_double = var1[5];

should be fully coalesced across threads in a warp.

While it’s good to acquire knowledge that will guide your coding, you seem to be a beginner. I wouldn’t spend a large amount of time studying the optimization of local memory access until you have your code working. I would suggest getting your code working first, then use analysis driven optimization to guide your optimization efforts:

http://www.nvidia.com/content/gtc-2010/pdfs/2012_gtc2010.pdf

This isn’t very sensible, from what I can see:

#pragma omp parallel num_threads(16)
{
process1<<<1,128>>>(z7, z5,real_codigo+2560*(z7*nr),imag_codigo+2560*(z7*nr),outreal1,outimag1,(double *)(frame1+512+sizeof(double)*(NRSAMPLES*nr*2)),(double *)(frame1+512+sizeof(double)*(NRSAMPLES*(nr*2+1))),br,eqin);
}

Even though you are spinning up 16 host threads, those kernels are all being launched into the default stream and will run sequentially. Furthermore, launching kernels of one block <<<1,128>>> is not a good way to achieve performance.