The error I receive is that in a kernel that uses d_channel_data is a misaligned load for any index into the appropriate input array.
Question: Is there anything wrong with how I am creating the data structure?
Note: Each input array is one slice of the data. The input arrays represent a 3D data structure if they could be combined into one. My experience has been to use boost multi_array objects (which internally store N-dimensions as a 1D array) to push data to and from the GPU. I have never used true 2D or 3D data structures in the GPU. It would be nice to have the data in a 3D data structure.
It is curious that you state that you have the same problem, because from the scant information above I can’t even tell what the problem is. Maybe you could post a minimal, standalone, buildable and runnable example that demonstrates the issue?
Representing a 2D-array as an array of pointers to row (or column) vectors instead of using contiguous storage is usually a bad idea, independent of any GPU programming. For example, it does not allow easy operations on arbitrary sub-matrices. If the host-side data structure cannot be changed, the standard recommendation is to use a contiguously stored matrix on the GPU, and fill it one row (or column) vector at a time, requiring one host->GPU copy operation per vector.
If so desired, a copy of an array-of-pointers-to-vectors data structure can certainly be recreated on the device (I have omitted all error checking for conciseness here; don’t do this in production code):
#include <cstdio>
#include <cstdlib>
#include <complex>
#include "cuComplex.h"
#define N (2)
#define M (3)
typedef std::complex<float> T;
__global__ void print_device_matrix (cuComplex** mat)
{
printf ("matrix on device:\n");
for (int i = 0; i < N; i++) {
for (int j = 0; j < M; j++) {
printf ("(%f, %f) ", cuCrealf (mat[i][j]), cuCimagf (mat[i][j]));
}
printf ("\n");
}
}
int main (void)
{
/* allocate host "matrix" */
T **mat = (T**)malloc (N * sizeof (mat[0]));
for (int i = 0; i < N; i++) {
mat[i] = (T *)malloc (M * sizeof (mat[0][0]));
}
/* fill in host "matrix" */
for (int i = 0; i < N; i++) {
for (int j = 0; j < M; j++) {
mat[i][j] = T (float(i)+1, float(j)+1);
}
}
/* print host "matrix" */
printf ("matrix on host:\n");
for (int i = 0; i < N; i++) {
for (int j = 0; j < M; j++) {
printf ("(%f, %f) ", real(mat[i][j]), imag(mat[i][j]));
}
printf ("\n");
}
/* allocate device "matrix" */
T **tmp = (T**)malloc (N * sizeof (tmp[0]));
for (int i = 0; i < N; i++) {
cudaMalloc ((void **)&tmp[i], M * sizeof (tmp[0][0]));
}
cuComplex **matD = 0;
cudaMalloc ((void **)&matD, N * sizeof (matD[0]));
/* copy "matrix" from host to device */
cudaMemcpy (matD, tmp, N * sizeof (matD[0]), cudaMemcpyHostToDevice);
for (int i = 0; i < N; i++) {
cudaMemcpy (tmp[i], mat[i], M * sizeof (matD[0][0]), cudaMemcpyHostToDevice);
}
free (tmp);
/* print device "matrix" */
print_device_matrix<<<1,1>>> (matD);
/* free host "matrix" */
for (int i = 0; i < N; i++) {
free (mat[i]);
}
free (mat);
/* free device "matrix" */
tmp = (T**)malloc (N * sizeof (tmp[0]));
cudaMemcpy (tmp, matD, N * sizeof (matD[0]), cudaMemcpyDeviceToHost);
for (int i = 0; i < N; i++) {
cudaFree (tmp[i]);
}
free (tmp);
cudaFree (matD);
return EXIT_SUCCESS;
}
Representing a 2D-array as an array of pointers to row (or column) vectors instead of using contiguous storage is usually a bad idea, independent of any GPU programming.
This isnt true. Perhaps each row of the matrix is a time series collected from a sensor. I have this usecase and I often will make an array of pointers to each row of the matrix and then pass pointer array to a thrust for_each to process all the rows for me independently (Say I want to whiten each row).
Maybe I am missing something, but I don’t see what prevents you from reading each time series of sensor data into the rows (or columns) of a contiguously stored matrix? I have worked with seismic sensor data (from oil exploration) stored in a contiguous 2D-matrix before, which would appear to match your use case. Similarly with microscopy data. Pointers to each row (or column) are readily available by simple index arithmetic.
In terms of efficient processing, a contiguously stored matrix will beat array-of-pointers-to-vectors on just about any metric: less overhead for (de-)allocation, trivial to copy, more efficient during element access (one memory access instead of two), convenient access to arbitrary sub-matrices in-place (e.g. for BLAS operations).
To get back on topic: Did my example code address your problem? As I said, I don’t know what the problem actually is, it hasn’t been clearly stated. I was just guessing it has to do with the details of replicating a host array-of-pointers-to-vectors data structure on the device.
I have reworked my example code in #4 to use single-precision complex data, in case the problem (whatever it is) was partially caused by confusion about interoperability between the standard C++ complex type and CUDA’s cuComplex type. The complex data types of C, C++, Fortran, and CUDA have identical layout. It should be noted that there is no full-fledged support for complex data in CUDA, and that cuComplex.h only defines the minimum needed to implement CUBLAS and CUFFT, and was designed prior to CUDA transitioning from C to C++, so it’s a bit clunky.