Dealing with strided data-arrays (perhaps with Thrust or beyond it)

wme7 · May 3, 2023, 9:35am

Hi everyone,

I come to you with a simple question:

What’s the best way to build accelerated functions with strided arrays?

Just for the sake of simplicity and clarity, I’ll use the following example[^1]:
Lets assume I have the following array:

float array[4][5];

I’m interested in dealing with a subset of 2-d array of it, namely a subarray of 2 rows and 3 cols that starts at &array[1][1]. As in the following depiction:

+-----+-----+-----+-----+-----+
| 0,0 | 0,1 | 0,2 | 0,3 | 0,4 |
+-----+=====+=====+=====+-----+
| 1,0 [ 1,1 | 1,2 | 1,3 ] 1,4 |
+-----+=====+=====+=====+-----+
| 2,0 [ 2,1 | 2,2 | 2,3 ] 2,4 |
+-----+=====+=====+=====+-----+
| 3,0 | 3,1 | 3,2 | 3,3 | 3,4 |
+-----+-----+-----+-----+-----+

On the host, this can be trivially be addressed by simply:

template <class T> 
void cpuModSubArray(T* array, int stride, int rows, int cols, T value)
{
    for (int i = 0; i < rows; i++)
        for (int j = 0; j < cols; j++)
            array[i * stride + j] = value;
}

and the call:

cpuModSubArray<float>(&array[1][1], 5, 2, 3, 3.1416);

On the device, the traditional approach is to simply write a cuda-kernel such as:

template <class T> 
__global__ void modSubArray(
    T* array, int stride, int rows, int cols, T value)
{
  size_t j = blockIdx.x * blockDim.x + threadIdx.x;
  size_t i = blockIdx.y * blockDim.y + threadIdx.y;
  if ((i < rows) && (j < cols)) {
    array[i * stride + j] = value;
  }
}

with the kernel wrapped as

template <class T> 
__host__ void gpuModSubArray<T>(T* array, int stride, int rows, int cols, const T value)
{
  dim3 threadsPerBlock(16, 8, 1);
  dim3 numberOfBlocks;
  numberOfBlocks.x = (cols + threadsPerBlock.x - 1) / threadsPerBlock.x;
  numberOfBlocks.y = (rows + threadsPerBlock.y - 1) / threadsPerBlock.y;
  numberOfBlocks.z = 1;

  modSubArray<<<numberOfBlocks, threadsPerBlock>>>(array, stride, rows, cols, value);
  devErrchk(cudaDeviceSynchronize());
}

and with a call:

gpuModSubArray<float>(&array[1][1], 5, 2, 3, 3.1416);

In principle, both: host and device implementations, produce the same output. However, the device implementation introduces extra parameters that are device dependent. In this example numberOfBlocks and threadsPerBlock. They are defined by tuning each application. Furthermore, I’m a little bothered by the fact that we are not utilizing the full power of C++ to write more general an error proof code.

For element-wise and/or reduction operations, dealing with contiguous data-arrays can be done easily with Thrust library (because it takes away the problem of tuning device-dependent parameters and is a C++ library). I think this was the objective for creating Thrust in the first place. Furthermore, it helps us to build single-source *.cpp implementations. However, for non-contiguous data-arrays, such as the subarray in the above example, the equivalent of Thrust implementation becomes cumbersome. Sometimes to the point that implementations are almost unreadable. See for example this snippet or this other reduction snippet.

I believe that in CUDA 12, there has to be an elegant C++ manner to deal with these type of operations on strided / non-contiguous data-arrays, either by using (perhaps newer funtionalities of) Thrust or any of the other libraries from CUDA. (but not as high as Eigen / Blaze / or other matrix library)

Perhaps anyone can enlighten my path on this regard?

[^1] : Disclaimer : I have not tested any of the snippets presented above. I just wrote them on the fly.

Topic		Replies	Views
reduction of dynamic array CUDA Programming and Performance	16	2019	December 1, 2016
Multi-dimensional arrays in a CUDA kernel? CUDA Programming and Performance	10	8027	August 27, 2017
How to put specific elements from one array to another array use CUDA? CUDA Programming and Performance cuda	6	1316	October 30, 2022
Algorithm query... CUDA Programming and Performance	3	449	March 17, 2011
Understanding and adjusting Mark Harris's array reduction CUDA Programming and Performance	11	4106	August 26, 2018
3D arrays - where to start? CUDA Programming and Performance	12	20124	October 13, 2013
Thrust functor return array; code optimization CUDA Programming and Performance	3	1104	December 27, 2019
Passing thurst vector into kernel and pushing data into vector CUDA Programming and Performance	8	7850	January 2, 2018
A built in way to quickly convert three float arrays into a single float3 array CUDA Programming and Performance	13	3859	December 20, 2013
custom Thrust iterator to combine consecutive elements of array CUDA Programming and Performance	1	1381	December 14, 2018

Dealing with strided data-arrays (perhaps with Thrust or beyond it)

Related topics