Optimizing a Kernel with a lot of variables - memory allocation.

bbsmitz · August 4, 2016, 6:17pm

Hey All,

So I have a kernel which requires a lot of constants to run. What I’ve been doing is defining and initializing them on the host side (some of them need to be calculated), and then passing them into the kernel when I call it so that I don’t spend calculation time during the execution working them out. I.e. my kernel definition looks like

__global__ void grating(float* d_input_xr, float* d_input_xi, float* d_mode_xr, float* d_mode_xi, double* d_pitchfork_xr, double* d_pitchfork_xi, double* d_out, int width, int height, double sd, double k, double pitch, int width_half, int height_half, int mode_x_half, int mode_y_half, double focal_length, double tar_g, double time, double tar_s, double tar_l, double mode_offsetx, double mode_offsetz, double int_screen_mult, double screen_offsetz, double screen_offsetx, double u, double v, double zero_time)

These variables and elements of the arrays, are all called upon during the execution of the kernel.

What I’m wondering however, is where are these written into in memory? Does the compiler put them in local memory? After reading through the nvidia documentation it seems that local memory can be both off-chip as well as stored in cache or registers. I want my kernel to be able to access them as quickly as possible, so I was wondering if maybe instead of calling them in like this, I should perhaps define shared memory variables, and calculate them all in the 1st thread I launch per block, so they’re physically closer to the cores?

I tried using Nsight to actually figure out where the compiler was putting them, but I wasn’t able to understand it – best I could come up with was that some of them were listed as type “local double”, others as type “double” and yet others as type “parameter int.”

In any case, any advice on how to handle the glut of constants I need would be much appreciated.

Cheers,

-BBsmitz

njuffa · August 4, 2016, 6:27pm

I am not quite sure what you are asking. Arguments to global functions are placed in .param space at the PTX level, and on all currently supported GPU architectures this in turn maps to a constant memory bank at machine code level. This makes the most sense since access to this data will tend to be uniform across all thread in a thread block.

You can also put constant data into constant variables yourself (these are read-only from the device, but writeable from the host), and they should go to a different constant memory bank than the global function arguments.

Either way, constant memory has fairly tight size limits (see the CUDA C Programming Guide for details), and is only really efficient if access if mostly uniform (= all threads in a warp access the same address), otherwise serialization during constant memory access will eat into performance.

Larger data should be passed to kernels via pointers to separately allocated global memory, and you would want to use the “const” and “restrict” modifiers with these pointers as appropriate (see Best Practices Guide).

Robert_Crovella · August 4, 2016, 6:31pm

Kernel arguments are stored in constant memory:

[url]Programming Guide :: CUDA Toolkit Documentation

What the compiler does with them after that will depend on your actual code.

If those parameters are actually constants unmodified by device code, you might want to mark them with const

For (POD) parameters that are known at compile time and don’t need to be computed, you could also consider not passing them via parameters but instead define them as global constants:

const double tar_s = 1.2;

If this is defined at global scope in the same compilation unit as the kernel, it can be used just as if you passed it via kernel parameter.

njuffa · August 4, 2016, 6:40pm

Many machine instructions provide for the direct use of constant memory data. Such references look like “c[0x0][0x48]”, in the disassembly generated by “cuobjdump – dump-sass”, where the first index is the constant memory bank, and the second index is the offset inside the bank.

Robert_Crovella · August 4, 2016, 6:42pm

I stand corrected and have edited my previous remarks.

BulatZiganshin · August 4, 2016, 9:11pm

well, the kernel arguments are just those pointers like d_mode_xr, and array data will be in global memory anyway. and i think that the array data are most interesting for topic starter. so, they either will be cached in L2 cache (by default) or in L1 texture cache if pointers are declared as “const float* restrict”. Note that L1 cache is only 12-24 KB, while L2 cache is 1-3 MB, so you may prefer to use L2 cache only

bbsmitz · August 5, 2016, 8:03pm

Thanks all for your comments. They were most helpful.

Topic		Replies	Views
Kernel requiring large number of parameters CUDA Programming and Performance	14	8690	September 5, 2008
Memory Questions CUDA Programming and Performance	6	2401	September 4, 2009
Defining global variables on the host and device at once? CUDA Programming and Performance	14	14503	December 19, 2020
Passing variables as parameter In which memory are they stored? CUDA Programming and Performance	14	2389	August 22, 2010
Passing variables into kernel over 256 bytes CUDA Programming and Performance	5	9608	July 12, 2011
Which kind of memories do variables in kernel reside in if not decleared Like macros and global vari CUDA Programming and Performance	4	3967	February 15, 2011
In registor or not the parameters defined in program CUDA Programming and Performance	8	8357	July 31, 2008
Passing an array to the kernel in constant memory CUDA Programming and Performance	3	4465	March 19, 2011
Dynamic Shared Memory allocation of more than one array CUDA Programming and Performance	4	4588	June 20, 2011
How does the compiler lay out local variables in local memory CUDA Programming and Performance	1	1730	April 30, 2021

Optimizing a Kernel with a lot of variables - memory allocation.

Related topics