[Problem] Misaligned loads for any index from a Array of Arrays

Problem: Any load from an array results in a misaligned load.

Input: master array that has pointers to N arrays of input data.

Goal: Load N different 1D streams that represent 2D data into one data structure that can be passed to kernels for processing.

Currently I have the array of arrays defined as:

cuComplex** d_channel_data;

Since I cannot alter GPU memory directly from the CPU I broke up the initialization of this data structure into multiple steps:

  1. Allocate each input array
  2. Save pointer to each input array in an array. (called cuda_channel_array)
  3. Create master array
  4. Copy list of pointers for each input array into the master array.

Here is the initialization code:

cuda_channel_array.resize ( m_channel_count );

for ( unsigned int i = 0; i < m_channel_count; i++ )
	CudaSafeCall ( cudaMalloc ( &cuda_channel_array[i],
		* common::memory::constants::MAX_DOPPLER_BIN
		* sizeof(cuComplex) ) );

// Allocate array of pointers
CudaSafeCall ( cudaMalloc ( (void**)&d_channel_data, m_channel_count * sizeof(cuComplex*) ) );

// Copy channel data pointers
CudaSafeCall ( cudaMemcpyAsync ( d_channel_data,
	m_channel_count * sizeof(std::complex<float>),
	stream ) );

The error I receive is that in a kernel that uses d_channel_data is a misaligned load for any index into the appropriate input array.

Question: Is there anything wrong with how I am creating the data structure?

Note: Each input array is one slice of the data. The input arrays represent a 3D data structure if they could be combined into one. My experience has been to use boost multi_array objects (which internally store N-dimensions as a 1D array) to push data to and from the GPU. I have never used true 2D or 3D data structures in the GPU. It would be nice to have the data in a 3D data structure.

Suggested changing the copying of channel data pointers to:

14.// Copy channel data pointers

15.CudaSafeCall ( cudaMemcpyAsync ( d_channel_data,

16.	&cuda_channel_array[0],

17.	m_channel_count * sizeof(std::complex<float>*),

18.	cudaMemcpyHostToDevice,

19.	stream ) );

This unfortunately still fails with a misaligned load.

I have this same problem. Any resolution?

It is curious that you state that you have the same problem, because from the scant information above I can’t even tell what the problem is. Maybe you could post a minimal, standalone, buildable and runnable example that demonstrates the issue?

Representing a 2D-array as an array of pointers to row (or column) vectors instead of using contiguous storage is usually a bad idea, independent of any GPU programming. For example, it does not allow easy operations on arbitrary sub-matrices. If the host-side data structure cannot be changed, the standard recommendation is to use a contiguously stored matrix on the GPU, and fill it one row (or column) vector at a time, requiring one host->GPU copy operation per vector.

If so desired, a copy of an array-of-pointers-to-vectors data structure can certainly be recreated on the device (I have omitted all error checking for conciseness here; don’t do this in production code):

#include <cstdio>
#include <cstdlib>
#include <complex>
#include "cuComplex.h"

#define N  (2)
#define M  (3)

typedef std::complex<float> T;

__global__ void print_device_matrix (cuComplex** mat)
    printf ("matrix on device:\n");
    for (int i = 0; i < N; i++) {
        for (int j = 0; j < M; j++) {
            printf ("(%f, %f)  ", cuCrealf (mat[i][j]), cuCimagf (mat[i][j]));
        printf ("\n");

int main (void)
    /* allocate host "matrix" */
    T **mat = (T**)malloc (N * sizeof (mat[0]));
    for (int i = 0; i < N; i++) {
        mat[i] = (T *)malloc (M * sizeof (mat[0][0]));
    /* fill in host "matrix" */
    for (int i = 0; i < N; i++) {
        for (int j = 0; j < M; j++) {
            mat[i][j] = T (float(i)+1, float(j)+1);

    /* print host "matrix" */
    printf ("matrix on host:\n");
    for (int i = 0; i < N; i++) {
        for (int j = 0; j < M; j++) {
            printf ("(%f, %f)  ", real(mat[i][j]), imag(mat[i][j]));
        printf ("\n");

    /* allocate device "matrix" */
    T **tmp = (T**)malloc (N * sizeof (tmp[0]));
    for (int i = 0; i < N; i++) {
        cudaMalloc ((void **)&tmp[i], M * sizeof (tmp[0][0]));
    cuComplex **matD = 0;
    cudaMalloc ((void **)&matD, N * sizeof (matD[0]));

    /* copy "matrix" from host to device */
    cudaMemcpy (matD, tmp, N * sizeof (matD[0]), cudaMemcpyHostToDevice);
    for (int i = 0; i < N; i++) {
        cudaMemcpy (tmp[i], mat[i], M * sizeof (matD[0][0]), cudaMemcpyHostToDevice);
    free (tmp);

    /* print device "matrix" */
    print_device_matrix<<<1,1>>> (matD);

    /* free host "matrix" */
    for (int i = 0; i < N; i++) {
        free (mat[i]);
    free (mat);
    /* free device "matrix" */
    tmp = (T**)malloc (N * sizeof (tmp[0]));
    cudaMemcpy (tmp, matD, N * sizeof (matD[0]), cudaMemcpyDeviceToHost);
    for (int i = 0; i < N; i++) {
        cudaFree (tmp[i]);
    free (tmp);
    cudaFree (matD);

    return EXIT_SUCCESS;

Representing a 2D-array as an array of pointers to row (or column) vectors instead of using contiguous storage is usually a bad idea, independent of any GPU programming.

This isnt true. Perhaps each row of the matrix is a time series collected from a sensor. I have this usecase and I often will make an array of pointers to each row of the matrix and then pass pointer array to a thrust for_each to process all the rows for me independently (Say I want to whiten each row).

Maybe I am missing something, but I don’t see what prevents you from reading each time series of sensor data into the rows (or columns) of a contiguously stored matrix? I have worked with seismic sensor data (from oil exploration) stored in a contiguous 2D-matrix before, which would appear to match your use case. Similarly with microscopy data. Pointers to each row (or column) are readily available by simple index arithmetic.

In terms of efficient processing, a contiguously stored matrix will beat array-of-pointers-to-vectors on just about any metric: less overhead for (de-)allocation, trivial to copy, more efficient during element access (one memory access instead of two), convenient access to arbitrary sub-matrices in-place (e.g. for BLAS operations).

To get back on topic: Did my example code address your problem? As I said, I don’t know what the problem actually is, it hasn’t been clearly stated. I was just guessing it has to do with the details of replicating a host array-of-pointers-to-vectors data structure on the device.

for_each operates off of an iterator range so you can use sub-ranges of a contiguous allocation to process things independently.

I have reworked my example code in #4 to use single-precision complex data, in case the problem (whatever it is) was partially caused by confusion about interoperability between the standard C++ complex type and CUDA’s cuComplex type. The complex data types of C, C++, Fortran, and CUDA have identical layout. It should be noted that there is no full-fledged support for complex data in CUDA, and that cuComplex.h only defines the minimum needed to implement CUBLAS and CUFFT, and was designed prior to CUDA transitioning from C to C++, so it’s a bit clunky.

It is true.