How to unpack/pack datatype on GPU shared memory using CUDA programming

Hi, I’ve encountered an problem while working on GPU processing with reduced precision(i.e. 4-bit integer, 8-bit integer).

I have eight 4-bit integer elements packed on an 32-bit integer element of shared memory array.
However, they are not arranged in a way I want them to be.

[Figure 1] shows data arrangement of a current shared memory array. (Let’s say the size of shared memory array with 32-bit datatype is 64). Each 32-bit element of the shared memory contains eight 4-bit integer data. I want each 4-bit data with the same color to be consecutive on the shared memory array as [Figure 2].

As I’m not really familiar with the implementation of inter-thread communication using CUDA, it is quite difficult to come up with an idea on how to gather and distribute data safely among threads. I wonder if this problem is even solvable with CUDA programming as 32 threads works together as a single warp.

I want the CUDA implementation to be 1)Data-safe, 2)Fairly effective and 3)Well-parallelized.

Generally speaking, in CUDA programming one typically assigns output data to threads. Each thread then gathers input data as needed to generate that output data.

I assume there is enough memory available that the re-ordering of the data does not have to happen in-place. In other words, input array and output array are going to be separate. As a starting point I would suggest that each thread produces one register worth of output, i.e. 32 bits. With that arrangement 64 threads are needed, each of which gathers eight 4-bit input data items and writes out a single 32-bit word of output. This gives you working code and a performance baseline.

In a second step you could examine whether using 128 threads each producing a uint16_t output or 256 threads each producing a uint8_t output are faster arrangements. You may also want to familiarize yourself with the CUDA profiler and keep an eye on shared memory specific performance metrics such as bank conflicts. In a next step you might want to examine the utility of warp-level shuffle primitives for this use case.

Taking one step back to avoid a possible XY-issue, I would suggest avoiding this kind of on-the-fly array-of-structure (AOS) to structure-of-array (SOA) re-arranging and use an SOA arrangement throughout. If this is data that is shared between host and device: SOA is usually preferred for efficient host processing as well.

1 Like

Thank you for your answer! However, I still got few questions with the effectiveness of the implementation.

If I’m using 32-bit int datatype for the output data. In this case, as a single load instruction brings 32-bit words, I should load 32-bit data only for the useful 4-bit data, which leads to 8 times L1 cache bandwidth waste.

For comparison, I could use uint8_t datatype. If a load instruction brings 8-bit data, the L1 cache bandwidth waste could reduce by 1/4(8 times → 2 times). However, I think this is not the case because, as far as I know, the atomic word size for memory load/store of GPU is 32-bit.

My questions are below:

  1. Does unit8_t implementation saves L1 bandwidth usage? Or does it perform better only due to the parallelism? (I wonder if the microarchitectural details may not be open to the public)
  2. If I’m going to reduce Shared Memory bandwidth waste, I guess warp-level shuffle may help because It can perform intra-warp, register-level reduction. Am I right with this?

The CUDA profiler is the appropriate tool to use when pondering questions about bandwidth utilization. It is a powerful tool for a data-driven software optimization process, and larges eliminates the need for theorizing.

1 Like

Thank you for your quick response!