access violation: mis ld

Sorry for posting this issue here but it was suggested in the memchecker forum that I post here.

I am getting a “mid ld” error when running the memchecker. I don’t understand why I am getting this error I get that the data needs to be aligned, but it appears to be. Can anyone shed some light on this?

For reference, here is a sample output of the error:

Summary of access violations:

Memory Checker detected 1 access violations.
error = misaligned load (global memory)
gridid = 281511
blockIdx = {2489,0,0}
threadIdx = {0,0,0}
address = 0x1312c3a548
accessSize = 8

CUDA Memory Checker detected 1 threads caused an access violation:
Launch Parameters
    CUcontext    = 02fcc720
    CUstream     = 02fc0170
    CUmodule     = 06fbc390
    CUfunction   = 070a2d20
    FunctionName = _Z25cudaColumnMeanPowerKernelPfP7ComplexIfEjj
    GridId       = 375246
    gridDim      = {5880,1,1}
    blockDim     = {1,1,1}
    sharedSize   = 256
    Parameters (raw):
         0x20e00000 0x00000013 0x09e00000 0x00000013
         0x00002000 0x000016f8
GPU State:
   Address  Size      Type  Mem       Block  Thread         blockIdx  threadIdx                                                    PC  Source
130e5d7a10     8    mis ld    g        2634       0       {2634,0,0}    {0,0,0}  _Z25cudaColumnMeanPowerKernelPfP7ComplexIfEjj+0000d8

Furthermore, I assume turning on “line numbers” when compiling should show me the offending source line. Is this correct?

The offending code kernel is this:

__global__ void cudaColumnMeanPowerKernel(FloatData* __restrict__ pOutputVec,
										  ComplexData* __restrict__ pInputMat,
										  UInt32 numRows,
										  UInt32 numCols)
	FloatData sum = 0.0;
	const UInt32 col = blockIdx.x;
	for (UInt32 k = 0; k < numRows; ++k)
		const UInt32 kk = k * numCols + col;
		sum += (pInputMat[kk].real * pInputMat[kk].real) + (pInputMat[kk].imag * pInputMat[kk].imag);
	pOutputVec[col] = sum / numRows;

I am thinking the problem is with the access to ComplexData which is 2 32-bit floats lined up for a total of 8 bytes. Does this data structure need to be aligned at an 8byte boundary?

ComplexData is aligned as follows:

template <typename T>
struct ALIGN(8) Complex
	T real, imag;
	CUDA_PREFIX Complex<T>() : real(0), imag(0) {}
	CUDA_PREFIX Complex<T>(T r, T i) : real(r), imag(i) {}
	CUDA_PREFIX Complex<T>& operator*=(T rhs) { real *= rhs; imag *= rhs; return *this; }
typedef Complex<Float32> ComplexFloat32;

typedef ComplexFloat32 ComplexData;

where #define ALIGN(X) align(X)

The compiler may indeed be attempting to do an 8-byte load. If so, it would need to be aligned to an 8-byte boundary.

I’m not sure what the issue is, since you haven’t provided a complete code, but the pInputMat pointer in the kernel tells us nothing about the alignment of the pointer that was passed to the kernel call. If you are doing something unnatural with the argument passed to the kernel for the pInputMat parameter, it could certainly create this observation.

compiling with -lineinfo and running with cuda-memcheck should show the offending source line. I don’t remember if/how this maps into VS with the memchecker built into VS, but it should certainly be possible to do this via a standalone cuda-memcheck run on windows.

If you provide a short, complete example, I’m sure someone can explain the issue.