How to unpack/pack datatype on GPU shared memory using CUDA programming

junkyeong.choi · October 1, 2021, 5:44am

Hi, I’ve encountered an problem while working on GPU processing with reduced precision(i.e. 4-bit integer, 8-bit integer).

I have eight 4-bit integer elements packed on an 32-bit integer element of shared memory array.
However, they are not arranged in a way I want them to be.

[Figure 1] shows data arrangement of a current shared memory array. (Let’s say the size of shared memory array with 32-bit datatype is 64). Each 32-bit element of the shared memory contains eight 4-bit integer data. I want each 4-bit data with the same color to be consecutive on the shared memory array as [Figure 2].

As I’m not really familiar with the implementation of inter-thread communication using CUDA, it is quite difficult to come up with an idea on how to gather and distribute data safely among threads. I wonder if this problem is even solvable with CUDA programming as 32 threads works together as a single warp.

I want the CUDA implementation to be 1)Data-safe, 2)Fairly effective and 3)Well-parallelized.

njuffa · October 1, 2021, 7:15am

Generally speaking, in CUDA programming one typically assigns output data to threads. Each thread then gathers input data as needed to generate that output data.

I assume there is enough memory available that the re-ordering of the data does not have to happen in-place. In other words, input array and output array are going to be separate. As a starting point I would suggest that each thread produces one register worth of output, i.e. 32 bits. With that arrangement 64 threads are needed, each of which gathers eight 4-bit input data items and writes out a single 32-bit word of output. This gives you working code and a performance baseline.

In a second step you could examine whether using 128 threads each producing a uint16_t output or 256 threads each producing a uint8_t output are faster arrangements. You may also want to familiarize yourself with the CUDA profiler and keep an eye on shared memory specific performance metrics such as bank conflicts. In a next step you might want to examine the utility of warp-level shuffle primitives for this use case.

Taking one step back to avoid a possible XY-issue, I would suggest avoiding this kind of on-the-fly array-of-structure (AOS) to structure-of-array (SOA) re-arranging and use an SOA arrangement throughout. If this is data that is shared between host and device: SOA is usually preferred for efficient host processing as well.

junkyeong.choi · October 5, 2021, 4:16am

Thank you for your answer! However, I still got few questions with the effectiveness of the implementation.

If I’m using 32-bit int datatype for the output data. In this case, as a single load instruction brings 32-bit words, I should load 32-bit data only for the useful 4-bit data, which leads to 8 times L1 cache bandwidth waste.

For comparison, I could use uint8_t datatype. If a load instruction brings 8-bit data, the L1 cache bandwidth waste could reduce by 1/4(8 times → 2 times). However, I think this is not the case because, as far as I know, the atomic word size for memory load/store of GPU is 32-bit.

My questions are below:

Does unit8_t implementation saves L1 bandwidth usage? Or does it perform better only due to the parallelism? (I wonder if the microarchitectural details may not be open to the public)
If I’m going to reduce Shared Memory bandwidth waste, I guess warp-level shuffle may help because It can perform intra-warp, register-level reduction. Am I right with this?

njuffa · October 5, 2021, 4:22am

The CUDA profiler is the appropriate tool to use when pondering questions about bandwidth utilization. It is a powerful tool for a data-driven software optimization process, and larges eliminates the need for theorizing.

junkyeong.choi · October 5, 2021, 4:24am

Thank you for your quick response!

system · December 4, 2021, 4:25am

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Code optimization with CDP and dynamic shared memory allocation CUDA Programming and Performance	18	64	January 13, 2025
How can I configure this problem is it too big to fit in shared memory? CUDA Programming and Performance	7	3756	October 14, 2008
beginner question regarding shared memory CUDA Programming and Performance	4	6922	November 16, 2009
Bank conflict hell: (could a shuffle based sort help)? CUDA Programming and Performance	17	4407	April 17, 2015
Using Shared Memory in CUDA C/C++ Technical Blog	36	1995	October 8, 2020
CUDA Use Cases run serial algorithms on composite data CUDA Programming and Performance	14	4492	October 24, 2008
Usage of uint3 CUDA Programming and Performance	5	31	April 23, 2025
numblocks and threads allocation rule? CUDA Programming and Performance	9	13278	March 5, 2011
CUDA Shuffle Instruction (Warp-level intra register exchange) CUDA Programming and Performance	8	8936	November 29, 2013
Cannot achieve max shared memory bandwith CUDA Programming and Performance	12	814	November 20, 2023

How to unpack/pack datatype on GPU shared memory using CUDA programming

Related topics