access violation: mis ld

From what I gather reading the manual, a “mis ld” type of error from the cuda memory checker logs indicates access of memory that is not word aligned. Is this correct? For reference, here is a sample output of the error:

Summary of access violations:
================================================================================

Memory Checker detected 1 access violations.
error = misaligned load (global memory)
gridid = 281511
blockIdx = {2489,0,0}
threadIdx = {0,0,0}
address = 0x1312c3a548
accessSize = 8

================================================================================
CUDA Memory Checker detected 1 threads caused an access violation:
Launch Parameters
    CUcontext    = 02fcc720
    CUstream     = 02fc0170
    CUmodule     = 06fbc390
    CUfunction   = 070a2d20
    FunctionName = _Z25cudaColumnMeanPowerKernelPfP7ComplexIfEjj
    GridId       = 375246
    gridDim      = {5880,1,1}
    blockDim     = {1,1,1}
    sharedSize   = 256
    Parameters:
    Parameters (raw):
         0x20e00000 0x00000013 0x09e00000 0x00000013
         0x00002000 0x000016f8
GPU State:
   Address  Size      Type  Mem       Block  Thread         blockIdx  threadIdx                                                    PC  Source
---------------------------------------------------------------------------------------------------------------------------------------------
130e5d7a10     8    mis ld    g        2634       0       {2634,0,0}    {0,0,0}  _Z25cudaColumnMeanPowerKernelPfP7ComplexIfEjj+0000d8

Furthermore, I assume turning on “line numbers” when compiling should show me the offending source line. Is this correct?

The offending code kernel is this:

__global__ void cudaColumnMeanPowerKernel(FloatData* __restrict__ pOutputVec,
										  ComplexData* __restrict__ pInputMat,
										  UInt32 numRows,
										  UInt32 numCols)
{
	FloatData sum = 0.0;
	const UInt32 col = blockIdx.x;
	for (UInt32 k = 0; k < numRows; ++k)
	{
		const UInt32 kk = k * numCols + col;
		sum += (pInputMat[kk].real * pInputMat[kk].real) + (pInputMat[kk].imag * pInputMat[kk].imag);
	}
	pOutputVec[col] = sum / numRows;
}

I am thinking the problem is with the access to ComplexData which is 2 32-bit floats lined up for a total of 8 bytes. Does this data structure need to be aligned at an 8byte boundary?

Hi,

mis ld means misaligned access during a memory load,I think you should align your ComplexData, but I’m not very familiar with the CUDA alignment part, I suggest you to ask the guys at the CUDA programming form.

Best Regards
Harry

As far as I can tell, our complex data is aligned. See the code below.

template <typename T>
struct ALIGN(8) Complex
{
	T real, imag;
	CUDA_PREFIX Complex<T>() : real(0), imag(0) {}
	CUDA_PREFIX Complex<T>(T r, T i) : real(r), imag(i) {}
	CUDA_PREFIX Complex<T>& operator*=(T rhs) { real *= rhs; imag *= rhs; return *this; }
};
typedef Complex<Float32> ComplexFloat32;

typedef ComplexFloat32 ComplexData;

Geez, it’s really complicated, sorry I’m not familiar with the CUDA alignment part, I think you can ask the guys in CUDA Programming and Performance.

https://devtalk.nvidia.com/default/board/57/cuda-programming-and-performance/