Sorry for posting this issue here but it was suggested in the memchecker forum that I post here.

I am getting a “mid ld” error when running the memchecker. I don’t understand why I am getting this error I get that the data needs to be aligned, but it appears to be. Can anyone shed some light on this?

For reference, here is a sample output of the error:

```
Summary of access violations:
================================================================================
Memory Checker detected 1 access violations.
error = misaligned load (global memory)
gridid = 281511
blockIdx = {2489,0,0}
threadIdx = {0,0,0}
address = 0x1312c3a548
accessSize = 8
================================================================================
CUDA Memory Checker detected 1 threads caused an access violation:
Launch Parameters
CUcontext = 02fcc720
CUstream = 02fc0170
CUmodule = 06fbc390
CUfunction = 070a2d20
FunctionName = _Z25cudaColumnMeanPowerKernelPfP7ComplexIfEjj
GridId = 375246
gridDim = {5880,1,1}
blockDim = {1,1,1}
sharedSize = 256
Parameters:
Parameters (raw):
0x20e00000 0x00000013 0x09e00000 0x00000013
0x00002000 0x000016f8
GPU State:
Address Size Type Mem Block Thread blockIdx threadIdx PC Source
---------------------------------------------------------------------------------------------------------------------------------------------
130e5d7a10 8 mis ld g 2634 0 {2634,0,0} {0,0,0} _Z25cudaColumnMeanPowerKernelPfP7ComplexIfEjj+0000d8
```

Furthermore, I assume turning on “line numbers” when compiling should show me the offending source line. Is this correct?

The offending code kernel is this:

```
__global__ void cudaColumnMeanPowerKernel(FloatData* __restrict__ pOutputVec,
ComplexData* __restrict__ pInputMat,
UInt32 numRows,
UInt32 numCols)
{
FloatData sum = 0.0;
const UInt32 col = blockIdx.x;
for (UInt32 k = 0; k < numRows; ++k)
{
const UInt32 kk = k * numCols + col;
sum += (pInputMat[kk].real * pInputMat[kk].real) + (pInputMat[kk].imag * pInputMat[kk].imag);
}
pOutputVec[col] = sum / numRows;
}
```

I am thinking the problem is with the access to ComplexData which is 2 32-bit floats lined up for a total of 8 bytes. Does this data structure need to be aligned at an 8byte boundary?

ComplexData is aligned as follows:

```
template <typename T>
struct ALIGN(8) Complex
{
T real, imag;
CUDA_PREFIX Complex<T>() : real(0), imag(0) {}
CUDA_PREFIX Complex<T>(T r, T i) : real(r), imag(i) {}
CUDA_PREFIX Complex<T>& operator*=(T rhs) { real *= rhs; imag *= rhs; return *this; }
};
typedef Complex<Float32> ComplexFloat32;
typedef ComplexFloat32 ComplexData;
```

where #define ALIGN(X) **align**(X)