Hi everyone,

I come to you with a simple question:

What’s the best way to build accelerated functions with strided arrays?

Just for the sake of simplicity and clarity, I’ll use the following example[^1]:

Lets assume I have the following array:

```
float array[4][5];
```

I’m interested in dealing with a subset of 2-d array of it, namely a subarray of 2 rows and 3 cols that starts at `&array[1][1]`

. As in the following depiction:

```
+-----+-----+-----+-----+-----+
| 0,0 | 0,1 | 0,2 | 0,3 | 0,4 |
+-----+=====+=====+=====+-----+
| 1,0 [ 1,1 | 1,2 | 1,3 ] 1,4 |
+-----+=====+=====+=====+-----+
| 2,0 [ 2,1 | 2,2 | 2,3 ] 2,4 |
+-----+=====+=====+=====+-----+
| 3,0 | 3,1 | 3,2 | 3,3 | 3,4 |
+-----+-----+-----+-----+-----+
```

On the **host**, this can be trivially be addressed by simply:

```
template <class T>
void cpuModSubArray(T* array, int stride, int rows, int cols, T value)
{
for (int i = 0; i < rows; i++)
for (int j = 0; j < cols; j++)
array[i * stride + j] = value;
}
```

and the call:

```
cpuModSubArray<float>(&array[1][1], 5, 2, 3, 3.1416);
```

On the **device**, the traditional approach is to simply write a cuda-kernel such as:

```
template <class T>
__global__ void modSubArray(
T* array, int stride, int rows, int cols, T value)
{
size_t j = blockIdx.x * blockDim.x + threadIdx.x;
size_t i = blockIdx.y * blockDim.y + threadIdx.y;
if ((i < rows) && (j < cols)) {
array[i * stride + j] = value;
}
}
```

with the kernel wrapped as

```
template <class T>
__host__ void gpuModSubArray<T>(T* array, int stride, int rows, int cols, const T value)
{
dim3 threadsPerBlock(16, 8, 1);
dim3 numberOfBlocks;
numberOfBlocks.x = (cols + threadsPerBlock.x - 1) / threadsPerBlock.x;
numberOfBlocks.y = (rows + threadsPerBlock.y - 1) / threadsPerBlock.y;
numberOfBlocks.z = 1;
modSubArray<<<numberOfBlocks, threadsPerBlock>>>(array, stride, rows, cols, value);
devErrchk(cudaDeviceSynchronize());
}
```

and with a call:

```
gpuModSubArray<float>(&array[1][1], 5, 2, 3, 3.1416);
```

In principle, both: **host** and **device** implementations, produce the same output. However, the device implementation introduces extra parameters that are device dependent. In this example `numberOfBlocks`

and `threadsPerBlock`

. They are defined by tuning each application. Furthermore, I’m a little bothered by the fact that we are not utilizing the full power of C++ to write more general an error proof code.

For *element-wise* and/or *reduction* operations, dealing with *contiguous* data-arrays can be done easily with Thrust library (because it takes away the problem of tuning device-dependent parameters and is a C++ library). I think this was the objective for creating Thrust in the first place. Furthermore, it helps us to build **single-source** *.cpp implementations. However, for *non-contiguous* data-arrays, such as the subarray in the above example, the equivalent of Thrust implementation becomes cumbersome. Sometimes to the point that implementations are almost unreadable. See for example this snippet or this other reduction snippet.

I believe that in CUDA 12, there has to be an elegant C++ manner to deal with these type of operations on *strided / non-contiguous data-arrays*, either by using (perhaps newer funtionalities of) Thrust or any of the other libraries from CUDA. (but not as high as Eigen / Blaze / or other matrix library)

Perhaps anyone can enlighten my path on this regard?

[^1] : Disclaimer : I have not tested any of the snippets presented above. I just wrote them on the fly.