Help removing misaligned access on C2070 Misaligned access on a Fermi system

So after digging into errors I’ve been getting, I found that by running cuda-memcheck, I can narrow down the problem to misaligned memory accesses on a Fermi architecture (C2070). The problem does not exist on older architectures.

By commenting out code, I’ve narrowed the problem to lines 30-42 of the attached file. The idea of this program is to create a hash table in global memory of the aggregated data in the input vector.

I’m new to developing on the Fermi and would appreciate any information regarding what causes the misaligned access and how to solve it.

If more information is required or if you have any questions/suggestions, please feel free to ask/suggest away.
gpuDB_kernels.cu (2.26 KB)

Bump!

I’ve been looking into this and can’t seem to figure out a solution. I’m looking for what causes misaligned access errors in Fermi, does anyone have any information on that?

From my understanding, misaligned access are better supported on Fermi versus older architectures. Then why does this run on a 250 GTS and fail on a C2070?

Please help!

Another bump up, with hopefully some better info.

So attached are the latest files (and more) that should allow you to run this on pretty much any system. As before it works on a 9500M and 250 GTS, but not on a C2070.

The general idea is to create a hash table in global memory in which the hash table entries contain aggregates of tuples of input data. The design is based off the hash table design in CUDA by Example.

I’ve compiled in debug mode and run it with “set cuda memcheck on” in the debugger to get the following errors in three different runs:

[Launch of CUDA Kernel 0 (computeAggregation) on Device 0]

Program received signal CUDA_EXCEPTION_1, Lane Illegal Address.

[Switching to CUDA Kernel 0 (<<<(19,0),(64,0,0)>>>)]

0x0000000000eb4d80 in computeAggregation ()

[Launch of CUDA Kernel 0 (computeAggregation) on Device 0]

Program received signal CUDA_EXCEPTION_1, Lane Illegal Address.

[Switching to CUDA Kernel 0 (<<<(0,0),(66,0,0)>>>)]

0x00000000007dbd80 in computeAggregation ()

[Launch of CUDA Kernel 0 (computeAggregation) on Device 0]

Program received signal CUDA_EXCEPTION_1, Lane Illegal Address.

[Switching to CUDA Kernel 0 (<<<(4,0),(162,0,0)>>>)]

0x0000000001cb5d80 in computeAggregation ()

Lane Illegal Address according to the debugging manual is caused by an “illegal(out of bounds) global address”. Stepping through the logic, I can’t seem to figure out where it might step out of bounds. Neither do I understand why it produces correct results on older (less strict) architectures, but errors out on the Fermi. I also removed the use of pointers and instead used a straight index to determine the location within the arrays. I know that on the Fermi there is a more universal address space, could this be the cause?

I’ve tried to further debug with breakpoints but due to it being a random thread which casuses the error in every iteration, do you have any suggestions on a better way to debug? Perhaps some specific commands, I was never too good with gdb.

Thanks in advance for any help, or simply making it this far in reading!

CUDA Toolkit: 3.2

Device: C2070

Compile options: sm_20
src.zip (4.9 KB)

Out of bounds access in shared memory.

But I’m not using any shared memory. At least not in the latest code/version that is uploaded.