Optimizing a Kernel with a lot of variables - memory allocation.

Hey All,

So I have a kernel which requires a lot of constants to run. What I’ve been doing is defining and initializing them on the host side (some of them need to be calculated), and then passing them into the kernel when I call it so that I don’t spend calculation time during the execution working them out. I.e. my kernel definition looks like

__global__ void grating(float* d_input_xr, float* d_input_xi, float* d_mode_xr, float* d_mode_xi, double* d_pitchfork_xr, double* d_pitchfork_xi, double* d_out, int width, int height, double sd, double k, double pitch, int width_half, int height_half, int mode_x_half, int mode_y_half, double focal_length, double tar_g, double time, double tar_s, double tar_l, double mode_offsetx, double mode_offsetz, double int_screen_mult, double screen_offsetz, double screen_offsetx, double u, double v, double zero_time)

These variables and elements of the arrays, are all called upon during the execution of the kernel.

What I’m wondering however, is where are these written into in memory? Does the compiler put them in local memory? After reading through the nvidia documentation it seems that local memory can be both off-chip as well as stored in cache or registers. I want my kernel to be able to access them as quickly as possible, so I was wondering if maybe instead of calling them in like this, I should perhaps define shared memory variables, and calculate them all in the 1st thread I launch per block, so they’re physically closer to the cores?

I tried using Nsight to actually figure out where the compiler was putting them, but I wasn’t able to understand it – best I could come up with was that some of them were listed as type “local double”, others as type “double” and yet others as type “parameter int.”

In any case, any advice on how to handle the glut of constants I need would be much appreciated.

Cheers,

-BBsmitz

I am not quite sure what you are asking. Arguments to global functions are placed in .param space at the PTX level, and on all currently supported GPU architectures this in turn maps to a constant memory bank at machine code level. This makes the most sense since access to this data will tend to be uniform across all thread in a thread block.

You can also put constant data into constant variables yourself (these are read-only from the device, but writeable from the host), and they should go to a different constant memory bank than the global function arguments.

Either way, constant memory has fairly tight size limits (see the CUDA C Programming Guide for details), and is only really efficient if access if mostly uniform (= all threads in a warp access the same address), otherwise serialization during constant memory access will eat into performance.

Larger data should be passed to kernels via pointers to separately allocated global memory, and you would want to use the “const” and “restrict” modifiers with these pointers as appropriate (see Best Practices Guide).

Kernel arguments are stored in constant memory:

[url]Programming Guide :: CUDA Toolkit Documentation

What the compiler does with them after that will depend on your actual code.

If those parameters are actually constants unmodified by device code, you might want to mark them with const

For (POD) parameters that are known at compile time and don’t need to be computed, you could also consider not passing them via parameters but instead define them as global constants:

const double tar_s = 1.2;

If this is defined at global scope in the same compilation unit as the kernel, it can be used just as if you passed it via kernel parameter.

Many machine instructions provide for the direct use of constant memory data. Such references look like “c[0x0][0x48]”, in the disassembly generated by “cuobjdump – dump-sass”, where the first index is the constant memory bank, and the second index is the offset inside the bank.

I stand corrected and have edited my previous remarks.

well, the kernel arguments are just those pointers like d_mode_xr, and array data will be in global memory anyway. and i think that the array data are most interesting for topic starter. so, they either will be cached in L2 cache (by default) or in L1 texture cache if pointers are declared as “const float* restrict”. Note that L1 cache is only 12-24 KB, while L2 cache is 1-3 MB, so you may prefer to use L2 cache only

Thanks all for your comments. They were most helpful.