Persistent Const Registry values or Initializing Registry from Host

I have code that works by copying host memory to device Constant memory and then within the kernel, transfers Constant memory to Kernel const values, which appear to be allocated to registry. This approach produces significantly faster results than continuously reading from Constant memory during search. Given that I am launching grid sizes with ~millions of blocks, and I have as many as 48 constant kernel values per block, is there a faster way to do this? or is the compiler recognising that these constants are the same for each block and optimizing code such that threads are using the same values as preceding blocks?

code example:-

//Load Registry from Constant Memory
const uint8_t r_CornerTileID = r_corner ? c_TileID[r_Tidx] : 0;
const uint8_t r_Edge1TileID = r_edge ? c_TileID[r_Edge1RotID] : 0;
const uint8_t r_Inner1TileID = c_TileID[r_Inner1RotID];
const uint8_t r_Edge1Top = r_edge ? c_Top[r_Edge1RotID] : 0;
const uint8_t r_Inner1Top = c_Top[r_Inner1RotID];
const uint8_t r_cornerLeft = r_corner ? c_Left[r_Tidx] : 0;
const uint8_t r_edge1Left = edge ? c_Left[r_Edge1RotID] : 0;
const uint8_t r_inner1Right = c_Right[r_Inner1RotID];
etc…

I assume you meant “register file” instead of “registry”. The speed of reading from __constant__ memory is only slightly lower than the speed of reading from registers as long as the access to __constant__ memory is uniform across all threads in a warp. In the general case, data placed in registers provides the highest performance option for storing data, and the compiler will place scalar data (like the data object shown in the code snippet) into registers in almost all cases, provided the code is built with optimizations enabled (which implies a release build).

Recent GPU architectures and CUDA versions provide an extended size limit for kernel arguments. As kernel arguments are passed via a designated __constant__ memory bank, you may be able to avoid a separate host → device copy to update __constant__ data. If you can avoid it, you would want to do so for increased efficiency. See:

const uint8_t r_CornerTileID = r_corner ? c_TileID[r_Tidx] : 0;

Use of integer data types narrower than int can lead to reduced performance by requiring additional instructions for widening or narrowing integer type conversion at various places in the code. At the hardware level a GPU register comprises 32 bits, which is the size of an int. As a rule of thumb, any integer would want to be of type int unless there are good reasons (which may include a demonstrated performance gain) for it to be some other type.

1 Like

Thank you njuffa,
https://developer.nvidia.com/blog/cuda-12-1-supports-large-kernel-parameters/
That’s brilliant I can set registry values from kernel parameters which up until now I thought were stored in slow Local memory. Although I assume that if parameters are using constant memory then this reduces the size of Constant memory for other data structures?

Your assumptions were correct! I should have said register file not registry and using I’m using release optimizations. I have experimented with the use of 32bit integers and found modest performance gains when using uint8_t, plus it allows for more Constant memory and increases the amount of data that can be transferred in a single read, although predominately during search I am performing broadcast reads from constant memory.

I do not know how these uint8_t are being used, but if you are primarily interested in the compression aspect of byte-sized data, keeping these bytes packed by four within a 32-bit register and operating on them with CUDA’s SIMD intrinsics may be another approach worth exploring. The most prominent use cases for such packed-byte operations that I am aware of is for sequence alignment of genetic data (Smith-Waterman, etc) and certain image processing applications.

1 Like

I’m using the uint8_t datatype as they are only storing values 0…32, and their usage is for comparison of one value to another, why using them is faster I don’t know???, I had presumed that comparison of two 8bit values was faster than comparison of two 32bit values. Any thoughts?

Without code to look at and run, I can only speak in vague generalities. By comparing statistics collected by the CUDA profiler (in particular pertaining to dynamic instruction count, stall events, and memory subsystem utilization) you should be able to pinpoint the salient differences in the performance characteristics of various design alternatives.

Like other profilers (e.g. Intel’s VTUNE), the CUDA profiler has become more capable over the years, allowing for more in-depth analysis. At the same time (again, like other profilers) this has made it more complex with an extended learning curve. Therefore my recommendation is to spend some quality time with it to gain hands-on experience on how best to utilize it for improving your code’s performance.

1 Like

Thank you for your replies, will experiment with the ideas you put forward.

There are cases, when this happens:

If you use an array as kernel parameter and write into it with indices not fixed at compile-time. Or in other words you use the kernel parameters like local variables and write into them (not with the intent to return the written values, which does not work, but to read them back locally), then they have to be stored somewhere.

To help the optimizer and make the intent clear, you can declare the types of the kernel parameters as const.

Normally all arithmetic operations work with the word size of 32-bit. There are very few exceptions, where a smaller int size has advantages, typically with very special instructions, e.g. SIMD instructions or 8-bit tensor core instructions. Sometimes using 32-bit floating-point can give an improvement depending on your instruction mix, as Nvidia GPUs have more floating-point than int performance. But conversions are slow.

In your case, where you only need 5 bits for 0…31, a manual comparison with subtractions (greater/smaller) or XOR (equality comparison) can make sense. The additional available bits help to compensate for overflow.

unsigned int A = 0x1209;
unsigned int B = 0x1308;

unsigned int C = (0x2020 | A) - B; // 0x2020 prevents borrow

With 32-bit numbers we can do 4 (or theoretically 5) 5-bit comparisons in one command. But could be still slower than doing four 32-bit comparisons.

Also have a look here:
https://docs.nvidia.com/cuda/cuda-math-api/group__CUDA__MATH__INTRINSIC__SIMD.html

But understand that most SIMD instructions are slow (with the exception of vabsdiff4) since several years (they were optimized with the very first CUDA architectures)

1 Like

Thanks Curefab for the explanation and for directing me to Math Intrinsics, hope to publish full results of trials in the near future. Although I had tried using int instead of uint8_t with a much slower algorithm, I just thought I would re-test using int as it is frequently stated in the documentation and on forums to use int. Testing over 100 problems took on average 3.4662302 seconds with uint8_t but 3.9977064 seconds with int, that’s significant!.

Although they are emulated, where the byte-wise variants can be used as intended they are usually noticeably faster than a discrete equivalent.