Memory checker reports unaligned 8-byte access at 0x209820590

Symptom
I am observing a CUDA memory checker failure complaining about an unaligned load where I believe the load is properly aligned. The error message looks like the address is actually aligned.
Log

GPU State:
   Address  Size      Type  Mem       Block  Thread         blockIdx  threadIdx                                          PC  Source
-----------------------------------------------------------------------------------------------------------------------------------
 209820590     8    mis ld    g          36     224         {36,0,0}  {224,0,0}  _Z12sieve_kernelIyEvT_PhS0_PKS0_S0_+000630  c:\users\spyffe\documents\visual studio 2015\projects\projecteuler\euler10\euler10.cu:30
 209820590     8    mis ld    g          36     225         {36,0,0}  {225,0,0}  _Z12sieve_kernelIyEvT_PhS0_PKS0_S0_+000630  c:\users\spyffe\documents\visual studio 2015\projects\projecteuler\euler10\euler10.cu:30
 209820590     8    mis ld    g          36     226         {36,0,0}  {226,0,0}  _Z12sieve_kernelIyEvT_PhS0_PKS0_S0_+000630  c:\users\spyffe\documents\visual studio 2015\projects\projecteuler\euler10\euler10.cu:30
 209820590     8    mis ld    g          36     227         {36,0,0}  {227,0,0}  _Z12sieve_kernelIyEvT_PhS0_PKS0_S0_+000630  c:\users\spyffe\documents\visual studio 2015\projects\projecteuler\euler10\euler10.cu:30
 209820590     8    mis ld    g          36     228         {36,0,0}  {228,0,0}  _Z12sieve_kernelIyEvT_PhS0_PKS0_S0_+000630  c:\users\spyffe\documents\visual studio 2015\projects\projecteuler\euler10\euler10.cu:30
 209820590     8    mis ld    g          36     229         {36,0,0}  {229,0,0}  _Z12sieve_kernelIyEvT_PhS0_PKS0_S0_+000630  c:\users\spyffe\documents\visual studio 2015\projects\projecteuler\euler10\euler10.cu:30
 209820590     8    mis ld    g          36     230         {36,0,0}  {230,0,0}  _Z12sieve_kernelIyEvT_PhS0_PKS0_S0_+000630  c:\users\spyffe\documents\visual studio 2015\projects\projecteuler\euler10\euler10.cu:30
 209820590     8    mis ld    g          36     231         {36,0,0}  {231,0,0}  _Z12sieve_kernelIyEvT_PhS0_PKS0_S0_+000630  c:\users\spyffe\documents\visual studio 2015\projects\projecteuler\euler10\euler10.cu:30
 209820590     8    mis ld    g          36     232         {36,0,0}  {232,0,0}  _Z12sieve_kernelIyEvT_PhS0_PKS0_S0_+000630  c:\users\spyffe\documents\visual studio 2015\projects\projecteuler\euler10\euler10.cu:30
 209820590     8    mis ld    g          36     233         {36,0,0}  {233,0,0}  _Z12sieve_kernelIyEvT_PhS0_PKS0_S0_+000630  c:\users\spyffe\documents\visual studio 2015\projects\projecteuler\euler10\euler10.cu:30
 209820590     8    mis ld    g          36     234         {36,0,0}  {234,0,0}  _Z12sieve_kernelIyEvT_PhS0_PKS0_S0_+000630  c:\users\spyffe\documents\visual studio 2015\projects\projecteuler\euler10\euler10.cu:30
 209820590     8    mis ld    g          36     235         {36,0,0}  {235,0,0}  _Z12sieve_kernelIyEvT_PhS0_PKS0_S0_+000630  c:\users\spyffe\documents\visual studio 2015\projects\projecteuler\euler10\euler10.cu:30
 209820590     8    mis ld    g          36     236         {36,0,0}  {236,0,0}  _Z12sieve_kernelIyEvT_PhS0_PKS0_S0_+000630  c:\users\spyffe\documents\visual studio 2015\projects\projecteuler\euler10\euler10.cu:30
 209820590     8    mis ld    g          36     237         {36,0,0}  {237,0,0}  _Z12sieve_kernelIyEvT_PhS0_PKS0_S0_+000630  c:\users\spyffe\documents\visual studio 2015\projects\projecteuler\euler10\euler10.cu:30
 209820590     8    mis ld    g          36     238         {36,0,0}  {238,0,0}  _Z12sieve_kernelIyEvT_PhS0_PKS0_S0_+000630  c:\users\spyffe\documents\visual studio 2015\projects\projecteuler\euler10\euler10.cu:30
 209820590     8    mis ld    g          36     239         {36,0,0}  {239,0,0}  _Z12sieve_kernelIyEvT_PhS0_PKS0_S0_+000630  c:\users\spyffe\documents\visual studio 2015\projects\projecteuler\euler10\euler10.cu:30
 209820590     8    mis ld    g          36     240         {36,0,0}  {240,0,0}  _Z12sieve_kernelIyEvT_PhS0_PKS0_S0_+000630  c:\users\spyffe\documents\visual studio 2015\projects\projecteuler\euler10\euler10.cu:30
 209820590     8    mis ld    g          36     241         {36,0,0}  {241,0,0}  _Z12sieve_kernelIyEvT_PhS0_PKS0_S0_+000630  c:\users\spyffe\documents\visual studio 2015\projects\projecteuler\euler10\euler10.cu:30
 209820590     8    mis ld    g          36     242         {36,0,0}  {242,0,0}  _Z12sieve_kernelIyEvT_PhS0_PKS0_S0_+000630  c:\users\spyffe\documents\visual studio 2015\projects\projecteuler\euler10\euler10.cu:30
 209820590     8    mis ld    g          36     243         {36,0,0}  {243,0,0}  _Z12sieve_kernelIyEvT_PhS0_PKS0_S0_+000630  c:\users\spyffe\documents\visual studio 2015\projects\projecteuler\euler10\euler10.cu:30
 209820590     8    mis ld    g          36     244         {36,0,0}  {244,0,0}  _Z12sieve_kernelIyEvT_PhS0_PKS0_S0_+000630  c:\users\spyffe\documents\visual studio 2015\projects\projecteuler\euler10\euler10.cu:30
 209820590     8    mis ld    g          36     245         {36,0,0}  {245,0,0}  _Z12sieve_kernelIyEvT_PhS0_PKS0_S0_+000630  c:\users\spyffe\documents\visual studio 2015\projects\projecteuler\euler10\euler10.cu:30
 209820590     8    mis ld    g          36     246         {36,0,0}  {246,0,0}  _Z12sieve_kernelIyEvT_PhS0_PKS0_S0_+000630  c:\users\spyffe\documents\visual studio 2015\projects\projecteuler\euler10\euler10.cu:30
 209820590     8    mis ld    g          36     247         {36,0,0}  {247,0,0}  _Z12sieve_kernelIyEvT_PhS0_PKS0_S0_+000630  c:\users\spyffe\documents\visual studio 2015\projects\projecteuler\euler10\euler10.cu:30
 209820590     8    mis ld    g          36     248         {36,0,0}  {248,0,0}  _Z12sieve_kernelIyEvT_PhS0_PKS0_S0_+000630  c:\users\spyffe\documents\visual studio 2015\projects\projecteuler\euler10\euler10.cu:30
 209820590     8    mis ld    g          36     249         {36,0,0}  {249,0,0}  _Z12sieve_kernelIyEvT_PhS0_PKS0_S0_+000630  c:\users\spyffe\documents\visual studio 2015\projects\projecteuler\euler10\euler10.cu:30
 209820590     8    mis ld    g          36     250         {36,0,0}  {250,0,0}  _Z12sieve_kernelIyEvT_PhS0_PKS0_S0_+000630  c:\users\spyffe\documents\visual studio 2015\projects\projecteuler\euler10\euler10.cu:30
 209820590     8    mis ld    g          36     251         {36,0,0}  {251,0,0}  _Z12sieve_kernelIyEvT_PhS0_PKS0_S0_+000630  c:\users\spyffe\documents\visual studio 2015\projects\projecteuler\euler10\euler10.cu:30
 209820590     8    mis ld    g          36     252         {36,0,0}  {252,0,0}  _Z12sieve_kernelIyEvT_PhS0_PKS0_S0_+000630  c:\users\spyffe\documents\visual studio 2015\projects\projecteuler\euler10\euler10.cu:30
 209820590     8    mis ld    g          36     253         {36,0,0}  {253,0,0}  _Z12sieve_kernelIyEvT_PhS0_PKS0_S0_+000630  c:\users\spyffe\documents\visual studio 2015\projects\projecteuler\euler10\euler10.cu:30
 209820590     8    mis ld    g          36     254         {36,0,0}  {254,0,0}  _Z12sieve_kernelIyEvT_PhS0_PKS0_S0_+000630  c:\users\spyffe\documents\visual studio 2015\projects\projecteuler\euler10\euler10.cu:30
 209820590     8    mis ld    g          36     255         {36,0,0}  {255,0,0}  _Z12sieve_kernelIyEvT_PhS0_PKS0_S0_+000630  c:\users\spyffe\documents\visual studio 2015\projects\projecteuler\euler10\euler10.cu:30

Configuration
I am using the following software:

  • Microsoft Visual Studio Community 2015 (version 14.0.25431.01 Update 3)
  • NVIDIA Nsight Visual Studio Edition 5.2.0.16223
  • NVIDIA CUDA compiler v8.0
  • NVIDIA driver 378.92 (WHQL)
  • Windows 10

The hardware is a single-GPU machine with a Titan X (Pascal).
Source code
This program is my attempt at a solution to Project Euler problem 10, implementing a prime sieve for the numbers up to 2,000,000. It’s a work in progress.

euler10.cu

#include "cuda_stdlib.h"
#include "cuda_sum.h"
#include <assert.h>

#include <algorithm>
#include <tuple>

template <typename PrimeType>
__global__ void sieve_kernel(
	PrimeType sieve_base,
	uint8_t *sieve,
	PrimeType sieve_size,
	const PrimeType *primes,
	PrimeType num_primes) {
	const PrimeType sieve_offset = cuda::get_thread_id();

	if (sieve_offset >= sieve_size) {
		// Deal with sieve sizes that aren't an exact multiple of the block size
		// This can cause thread deviation in warps in the last block
		return;
	}

	const PrimeType candidate = sieve_base + sieve_offset;

	bool is_prime = true;

	for (PrimeType prime_idx = 0;
		prime_idx < num_primes;
		++prime_idx) {
		const PrimeType candidate_divisor = primes[prime_idx];
		const PrimeType remainder = candidate % candidate_divisor;
		is_prime &= (remainder != 0);
	}

	sieve[sieve_offset] = (is_prime ? 1 : 0);
}

template <typename PrimeType>
__global__ void sieve_to_counts_kernel(
	PrimeType *counts,
	PrimeType num_counts,
	const uint8_t *sieve,
	PrimeType sieve_size,
	PrimeType sieve_entries_per_count
) {
	const PrimeType count_index = cuda::get_thread_id();
	if (count_index > num_counts) {
		// Deal with num_counts not an exact multiple of threads_per_block
		return;
	}

	PrimeType count = 0;

	const uint8_t *start_ptr = &sieve[count_index * sieve_entries_per_count];
	const uint8_t *end_ptr = &sieve[(count_index + 1) * sieve_entries_per_count];
	if (end_ptr >= &sieve[sieve_size]) {
		// Deal with sieve sizes that aren't an exact multiple of the block size
		// This can cause thread deviation in warps in the last block
		end_ptr = &sieve[sieve_size];
	}
	for (const uint8_t *sieve_ptr = start_ptr; sieve_ptr < end_ptr; ++sieve_ptr) {
		count += ((PrimeType)*sieve_ptr ? 1 : 0);
	}
	counts[count_index] = count;
}

template <typename PrimeType>
__global__ void counts_to_starts_kernel(
	PrimeType *counts, // modified in place
	PrimeType num_counts
) {
	PrimeType start = 0;
	for (PrimeType *count_ptr = &counts[0]; count_ptr < &counts[num_counts + 1]; ++count_ptr) {
		PrimeType start_for_this_count = start;
		start += (PrimeType)*count_ptr;
		*count_ptr = start_for_this_count;
	}
}

template <typename PrimeType>
__global__ void sieve_to_primes_kernel(
	PrimeType *primes,
	const PrimeType *starts,
	PrimeType num_starts,
	PrimeType sieve_entries_per_count,
	const uint8_t *sieve,
	PrimeType sieve_size,
	PrimeType sieve_base
) {
	const PrimeType start_index = cuda::get_thread_id();

	if (start_index > num_starts) {
		return; // when num_starts isn't a multiple of threads_per_block
	}

	PrimeType *prime_ptr = &primes[starts[start_index]];
	PrimeType prime_value = sieve_base + (start_index * sieve_entries_per_count);

	for (const uint8_t *sieve_ptr = &sieve[start_index * sieve_entries_per_count];
		sieve_ptr < &sieve[(start_index + 1) * sieve_entries_per_count] &&
		sieve_ptr < &sieve[sieve_size];
		++sieve_ptr, ++prime_value) {
		if (*sieve_ptr == 1) {
			*prime_ptr = prime_value;
			++prime_ptr;
		}
	}
}

#define DEBUGGING

template <typename PrimeType>
std::pair<cuda::ptr<PrimeType>, PrimeType> sieve_primes(PrimeType max) {
	struct Mark {
	private:
		PrimeType m;
	public:
		Mark(PrimeType m) { this->m = m; }
		PrimeType max() const { return m; }
		PrimeType size() const { return m + 1; }
	} mark(2);

	cuda::ptr<PrimeType> primes(std::vector<PrimeType>{ 2 });
	PrimeType num_primes = 1;

	while (mark.max() < max) {
		const Mark new_mark((max / mark.max() >= mark.max()) ? (mark.max() * mark.max()) : max);
		const PrimeType sieve_base = mark.max() + 1;
		const PrimeType sieve_size = (new_mark.size() - mark.size());
		cuda::ptr<uint8_t> sieve(cuda::array_sizeof<uint8_t>(sieve_size));

		{
			const std::pair<dim3, dim3> dims = cuda::get_dimensions(sieve_size);

			sieve_kernel<PrimeType> <<<dims.first, dims.second>>>(
				sieve_base,
				sieve.get(),
				sieve_size,
				primes.get(),
				num_primes);

			if ((cuda::maybe_report_error("cudaDeviceSynchronize failed: ", cudaDeviceSynchronize()) != cudaSuccess) ||
				(cuda::maybe_report_error("addKernel launch failed: ", cudaGetLastError()) != cudaSuccess)) {
				exit(-1);
			}

		}

		mark = new_mark;

		const PrimeType sieve_entries_per_count = 16;
		const PrimeType num_counts = (sieve_size + sieve_entries_per_count - 1) / sieve_entries_per_count;

		// We allocate an extra count so that, when we convert to starts,
		// the first element is 0 and last element contains the # of primes.
		cuda::ptr<PrimeType> counts(cuda::array_sizeof<PrimeType>(num_counts + 1));

		{
			const std::pair<dim3, dim3> dims = cuda::get_dimensions(num_counts);

			sieve_to_counts_kernel<PrimeType> <<<dims.first, dims.second>>> (
				counts.get(),
				num_counts,
				sieve.get(),
				sieve_size,
				sieve_entries_per_count
				);
		}

#ifdef DEBUGGING
		std::vector<PrimeType> primes_local(num_primes);
		cuda::memcpy(primes_local.data(), primes, cuda::array_sizeof<PrimeType>(num_primes));

		std::vector<uint8_t> sieve_local(sieve_size);
		cuda::memcpy(sieve_local.data(), sieve, cuda::array_sizeof<uint8_t>(sieve_size));

		std::vector<PrimeType> counts_local(num_counts);
		cuda::memcpy(counts_local.data(), counts, cuda::array_sizeof<PrimeType>(num_counts));
#endif

		{
			counts_to_starts_kernel<PrimeType> <<<1, 1>>> (
				counts.get(),
				num_counts
				);
		}

#ifdef DEBUGGING
		std::vector<PrimeType> starts_local(num_counts);
		cuda::memcpy(starts_local.data(), counts, cuda::array_sizeof<PrimeType>(num_counts));
#endif

		PrimeType num_additional_primes = 0;

		cuda::maybe_report_error("Couldn't memcpy count from device: ",
			cudaMemcpy(&num_additional_primes, &counts.get()[num_counts], sizeof(PrimeType), cudaMemcpyDeviceToHost));

		const PrimeType new_num_primes = num_primes + num_additional_primes;
		cuda::ptr<PrimeType> new_primes(cuda::array_sizeof<PrimeType>(new_num_primes));

		cuda::memcpy(new_primes, primes, cuda::array_sizeof<PrimeType>(num_primes));

		{
			const std::pair<dim3, dim3> dims = cuda::get_dimensions(num_counts);

			sieve_to_primes_kernel<PrimeType> <<<dims.first, dims.second >>> (
				&new_primes.get()[num_primes],
				counts.get(),
				num_counts,
				sieve_entries_per_count,
				sieve.get(),
				sieve_size,
				sieve_base
				);
		}

#ifdef DEBUGGING
		std::vector<PrimeType> new_primes_local(new_num_primes);
		cuda::memcpy(new_primes_local.data(), new_primes, cuda::array_sizeof<PrimeType>(new_num_primes));
#endif

		primes = std::move(new_primes);
		num_primes = new_num_primes;
	}
	
	return std::make_pair(std::move(primes), num_primes);
}

int main()
{
	cuda::session session;

	typedef uint64_t PrimeType;
	PrimeType num_primes;
	cuda::ptr<PrimeType> primes;
	
	std::tie(primes, num_primes) = sieve_primes<PrimeType>(2000000);

	PrimeType sum = cuda::sum<PrimeType>(primes, (size_t)num_primes);

	return 0;
}

cuda_stdlib.h
So I don’t have to write as much boilerplate to allocate memory.

#ifndef CUDA_STDLIB_H
#define CUDA_STDLIB_H

#include "cuda_runtime.h"
#include "device_launch_parameters.h"

#include <iostream>
#include <vector>

namespace cuda {

	void report_error(const char *str, const cudaError_t error) {
		std::cerr << str << cudaGetErrorString(error);
	}

	cudaError_t maybe_report_error(const char *str, const cudaError_t error) {
		if (error != cudaSuccess) {
			report_error(str, error);
		}
		return error;
	}

	struct session {
		bool v = false;
	public:
		session(int device = 0) {
			v = (maybe_report_error("Couldn't select CUDA device (doesn't exist?): ",
				cudaSetDevice(device)) == cudaSuccess);
		}

		bool valid() {
			return v;
		}

		~session() {
			// cudaDeviceReset must be called before exiting in order for profiling and
			// tracing tools such as Nsight and Visual Profiler to show complete traces.
			maybe_report_error("Couldn't reset CUDA: ", cudaDeviceReset());
		}
	};

	template <typename T> struct ptr {
	private:
		T *p = nullptr;
	public:
		ptr(size_t size = sizeof(T)) {
			if (maybe_report_error("Couldn't malloc on device: ",
				cudaMalloc(&p, size)) != cudaSuccess) {
				p = nullptr;
			}
		}

		ptr(ptr<T> &&rhs) {
			p = rhs.p;
			rhs.p = nullptr;
		}

		ptr(const std::vector<T> &init) {
			if (maybe_report_error("Couldn't malloc on device: ",
				cudaMalloc(&p, init.size() * sizeof(T))) != cudaSuccess) {
				p = nullptr;
				return;
			}
			if (maybe_report_error("Couldn't memcpy to device: ",
				cudaMemcpy(p, init.data(), init.size() * sizeof(T), cudaMemcpyHostToDevice)) != cudaSuccess) {
				cudaFree(p);
				p = nullptr;
			}
		}

		ptr<T> &operator=(ptr<T> &&rhs) {
			if (p) {
				cudaFree(p);
				p = nullptr;
			}
			p = rhs.p;
			rhs.p = nullptr;
			return *this;
		}

		~ptr() {
			if (p) {
				cudaFree(p);
				p = nullptr;
			}
		}

		T *get() {
			return p;
		}

		const T *get() const {
			return p;
		}
	};

	template <typename T> cudaError_t memcpy(T *dst, const ptr<T> &src, size_t size) {
		return maybe_report_error("Couldn't memcpy from device: ",
			cudaMemcpy(dst, src.get(), size, cudaMemcpyDeviceToHost));
	}

	template <typename T> cudaError_t memcpy(ptr<T> &dst, const T *src, size_t size) {
		return maybe_report_error("Couldn't memcpy to device: ",
			cudaMemcpy(dst.get(), src, size, cudaMemcpyHostToDevice));
	}

	template <typename T> cudaError_t memcpy(ptr<T> &dst, const ptr<T> &src, size_t size) {
		return maybe_report_error("Couldn't memcpy on device: ",
			cudaMemcpy(dst.get(), src.get(), size, cudaMemcpyDeviceToDevice));
	}

	template <typename T>
	size_t array_sizeof(size_t num_items) {
		return sizeof(T) * num_items;
	}

	const size_t threads_per_block = 512;

	std::pair<dim3, dim3> get_dimensions(size_t num_threads) {
		const dim3 blockDim = num_threads > threads_per_block ? (unsigned int)((num_threads + threads_per_block - 1) / threads_per_block) : 1;
		const dim3 threadDim = num_threads > threads_per_block ? threads_per_block : (unsigned int)num_threads;
		return std::make_pair(blockDim, threadDim);
	}

	__device__ size_t get_thread_id() {
		return ((size_t)blockIdx.x * (size_t)blockDim.x) + threadIdx.x;
	}
}

#endif //CUDA_STDLIB_H

cuda_sum.h
This in particular is a WIP… I intend to make this a parallel sum but I have a placeholder for debugging purposes.

#ifndef CUDA_SUM_H
#define CUDA_SUM_H

#include "cuda_stdlib.h"

template <typename T> __global__ void sum_kernel(T *work, uint8_t *overflow, const T *inputs, size_t input_size) {
	size_t tid = cuda::get_thread_id();

	if (tid == 0) {
		T sum = 0;
		*overflow = 0;

		for (size_t i = 0; i < input_size; ++i) {
			T new_sum = sum + inputs[i];
			if (new_sum < sum) {
				*overflow = 1;
			}
			sum = new_sum;
		}

		work[0] = sum;
	}
}


namespace cuda {

	template <typename T> T sum(const ptr<T> &inputs, const size_t num_inputs) {
		ptr<T> work(array_sizeof<T>(num_inputs));
		ptr<uint8_t> overflow(sizeof(uint8_t));

		std::pair<dim3, dim3> dimensions = get_dimensions(num_inputs);

		sum_kernel <<<dimensions.first, dimensions.second>>> (work.get(), overflow.get(), inputs.get(), num_inputs);

		T ret;
		cuda::memcpy(&ret, work, sizeof(T));

		uint8_t overflow_local = 0;
		cuda::memcpy(&overflow_local, overflow, sizeof(uint8_t));

		return ret;
	}

}

#endif

Please find the full log below, including all the other Nsight messages during startup.

CUDA context created : 1c1bfc834a0
CUDA module loaded:   1c1cd097730 euler10.cu
CUDA grid launch failed: CUcontext: 1931657884832 CUmodule: 1931880265520 Function: _Z12sieve_kernelIyEvT_PhS0_PKS0_S0_
CUDA context created : 19d8a34a940
CUDA module loaded:   19d9f0a07d0 euler10.cu
CUDA grid launch failed: CUcontext: 1776140200256 CUmodule: 1776489727952 Function: _Z12sieve_kernelIyEvT_PhS0_PKS0_S0_
CUDA context created : 17e40d85de0
CUDA module loaded:   17e4e36d620 euler10.cu
CUDA grid launch failed: CUcontext: 1641765428704 CUmodule: 1641989723680 Function: _Z12sieve_kernelIyEvT_PhS0_PKS0_S0_
CUDA context created : 23526a25540
CUDA module loaded:   23533e05bd0 euler10.cu
CUDA grid launch failed: CUcontext: 2427304695104 CUmodule: 2427526863824 Function: _Z12sieve_kernelIyEvT_PhS0_PKS0_S0_
CUDA context created : 1d8b63738d0
CUDA module loaded:   1d8c3895130 euler10.cu
================================================================================
CUDA Memory Checker detected 32 threads caused an access violation:
Launch Parameters
    CUcontext    = 1d8b63738d0
    CUstream     = 1d8b8423270
    CUmodule     = 1d8c3895130
    CUfunction   = 1d8cb292c60
    FunctionName = _Z12sieve_kernelIyEvT_PhS0_PKS0_S0_
    GridId       = 31
    gridDim      = {3779,1,1}
    blockDim     = {512,1,1}
    sharedSize   = 256
    Parameters:
        sieve_base = 65537
        sieve = 0x0000000209a00000  1 ''
        sieve_size = 1934464
        primes = 0x0000000209818000  2
        num_primes = 6542
    Parameters (raw):
         0x00010001 0x00000000 0x09a00000 0x00000002
         0x001d8480 0x00000000 0x09818000 0x00000002
         0x0000198e 0x00000000
GPU State:
   Address  Size      Type  Mem       Block  Thread         blockIdx  threadIdx                                          PC  Source
-----------------------------------------------------------------------------------------------------------------------------------
 209820590     8    mis ld    g          36     224         {36,0,0}  {224,0,0}  _Z12sieve_kernelIyEvT_PhS0_PKS0_S0_+000630  c:\users\spyffe\documents\visual studio 2015\projects\projecteuler\euler10\euler10.cu:30
 209820590     8    mis ld    g          36     225         {36,0,0}  {225,0,0}  _Z12sieve_kernelIyEvT_PhS0_PKS0_S0_+000630  c:\users\spyffe\documents\visual studio 2015\projects\projecteuler\euler10\euler10.cu:30
 209820590     8    mis ld    g          36     226         {36,0,0}  {226,0,0}  _Z12sieve_kernelIyEvT_PhS0_PKS0_S0_+000630  c:\users\spyffe\documents\visual studio 2015\projects\projecteuler\euler10\euler10.cu:30
 209820590     8    mis ld    g          36     227         {36,0,0}  {227,0,0}  _Z12sieve_kernelIyEvT_PhS0_PKS0_S0_+000630  c:\users\spyffe\documents\visual studio 2015\projects\projecteuler\euler10\euler10.cu:30
 209820590     8    mis ld    g          36     228         {36,0,0}  {228,0,0}  _Z12sieve_kernelIyEvT_PhS0_PKS0_S0_+000630  c:\users\spyffe\documents\visual studio 2015\projects\projecteuler\euler10\euler10.cu:30
 209820590     8    mis ld    g          36     229         {36,0,0}  {229,0,0}  _Z12sieve_kernelIyEvT_PhS0_PKS0_S0_+000630  c:\users\spyffe\documents\visual studio 2015\projects\projecteuler\euler10\euler10.cu:30
 209820590     8    mis ld    g          36     230         {36,0,0}  {230,0,0}  _Z12sieve_kernelIyEvT_PhS0_PKS0_S0_+000630  c:\users\spyffe\documents\visual studio 2015\projects\projecteuler\euler10\euler10.cu:30
 209820590     8    mis ld    g          36     231         {36,0,0}  {231,0,0}  _Z12sieve_kernelIyEvT_PhS0_PKS0_S0_+000630  c:\users\spyffe\documents\visual studio 2015\projects\projecteuler\euler10\euler10.cu:30
 209820590     8    mis ld    g          36     232         {36,0,0}  {232,0,0}  _Z12sieve_kernelIyEvT_PhS0_PKS0_S0_+000630  c:\users\spyffe\documents\visual studio 2015\projects\projecteuler\euler10\euler10.cu:30
 209820590     8    mis ld    g          36     233         {36,0,0}  {233,0,0}  _Z12sieve_kernelIyEvT_PhS0_PKS0_S0_+000630  c:\users\spyffe\documents\visual studio 2015\projects\projecteuler\euler10\euler10.cu:30
 209820590     8    mis ld    g          36     234         {36,0,0}  {234,0,0}  _Z12sieve_kernelIyEvT_PhS0_PKS0_S0_+000630  c:\users\spyffe\documents\visual studio 2015\projects\projecteuler\euler10\euler10.cu:30
 209820590     8    mis ld    g          36     235         {36,0,0}  {235,0,0}  _Z12sieve_kernelIyEvT_PhS0_PKS0_S0_+000630  c:\users\spyffe\documents\visual studio 2015\projects\projecteuler\euler10\euler10.cu:30
 209820590     8    mis ld    g          36     236         {36,0,0}  {236,0,0}  _Z12sieve_kernelIyEvT_PhS0_PKS0_S0_+000630  c:\users\spyffe\documents\visual studio 2015\projects\projecteuler\euler10\euler10.cu:30
 209820590     8    mis ld    g          36     237         {36,0,0}  {237,0,0}  _Z12sieve_kernelIyEvT_PhS0_PKS0_S0_+000630  c:\users\spyffe\documents\visual studio 2015\projects\projecteuler\euler10\euler10.cu:30
 209820590     8    mis ld    g          36     238         {36,0,0}  {238,0,0}  _Z12sieve_kernelIyEvT_PhS0_PKS0_S0_+000630  c:\users\spyffe\documents\visual studio 2015\projects\projecteuler\euler10\euler10.cu:30
 209820590     8    mis ld    g          36     239         {36,0,0}  {239,0,0}  _Z12sieve_kernelIyEvT_PhS0_PKS0_S0_+000630  c:\users\spyffe\documents\visual studio 2015\projects\projecteuler\euler10\euler10.cu:30
 209820590     8    mis ld    g          36     240         {36,0,0}  {240,0,0}  _Z12sieve_kernelIyEvT_PhS0_PKS0_S0_+000630  c:\users\spyffe\documents\visual studio 2015\projects\projecteuler\euler10\euler10.cu:30
 209820590     8    mis ld    g          36     241         {36,0,0}  {241,0,0}  _Z12sieve_kernelIyEvT_PhS0_PKS0_S0_+000630  c:\users\spyffe\documents\visual studio 2015\projects\projecteuler\euler10\euler10.cu:30
 209820590     8    mis ld    g          36     242         {36,0,0}  {242,0,0}  _Z12sieve_kernelIyEvT_PhS0_PKS0_S0_+000630  c:\users\spyffe\documents\visual studio 2015\projects\projecteuler\euler10\euler10.cu:30
 209820590     8    mis ld    g          36     243         {36,0,0}  {243,0,0}  _Z12sieve_kernelIyEvT_PhS0_PKS0_S0_+000630  c:\users\spyffe\documents\visual studio 2015\projects\projecteuler\euler10\euler10.cu:30
 209820590     8    mis ld    g          36     244         {36,0,0}  {244,0,0}  _Z12sieve_kernelIyEvT_PhS0_PKS0_S0_+000630  c:\users\spyffe\documents\visual studio 2015\projects\projecteuler\euler10\euler10.cu:30
 209820590     8    mis ld    g          36     245         {36,0,0}  {245,0,0}  _Z12sieve_kernelIyEvT_PhS0_PKS0_S0_+000630  c:\users\spyffe\documents\visual studio 2015\projects\projecteuler\euler10\euler10.cu:30
 209820590     8    mis ld    g          36     246         {36,0,0}  {246,0,0}  _Z12sieve_kernelIyEvT_PhS0_PKS0_S0_+000630  c:\users\spyffe\documents\visual studio 2015\projects\projecteuler\euler10\euler10.cu:30
 209820590     8    mis ld    g          36     247         {36,0,0}  {247,0,0}  _Z12sieve_kernelIyEvT_PhS0_PKS0_S0_+000630  c:\users\spyffe\documents\visual studio 2015\projects\projecteuler\euler10\euler10.cu:30
 209820590     8    mis ld    g          36     248         {36,0,0}  {248,0,0}  _Z12sieve_kernelIyEvT_PhS0_PKS0_S0_+000630  c:\users\spyffe\documents\visual studio 2015\projects\projecteuler\euler10\euler10.cu:30
 209820590     8    mis ld    g          36     249         {36,0,0}  {249,0,0}  _Z12sieve_kernelIyEvT_PhS0_PKS0_S0_+000630  c:\users\spyffe\documents\visual studio 2015\projects\projecteuler\euler10\euler10.cu:30
 209820590     8    mis ld    g          36     250         {36,0,0}  {250,0,0}  _Z12sieve_kernelIyEvT_PhS0_PKS0_S0_+000630  c:\users\spyffe\documents\visual studio 2015\projects\projecteuler\euler10\euler10.cu:30
 209820590     8    mis ld    g          36     251         {36,0,0}  {251,0,0}  _Z12sieve_kernelIyEvT_PhS0_PKS0_S0_+000630  c:\users\spyffe\documents\visual studio 2015\projects\projecteuler\euler10\euler10.cu:30
 209820590     8    mis ld    g          36     252         {36,0,0}  {252,0,0}  _Z12sieve_kernelIyEvT_PhS0_PKS0_S0_+000630  c:\users\spyffe\documents\visual studio 2015\projects\projecteuler\euler10\euler10.cu:30
 209820590     8    mis ld    g          36     253         {36,0,0}  {253,0,0}  _Z12sieve_kernelIyEvT_PhS0_PKS0_S0_+000630  c:\users\spyffe\documents\visual studio 2015\projects\projecteuler\euler10\euler10.cu:30
 209820590     8    mis ld    g          36     254         {36,0,0}  {254,0,0}  _Z12sieve_kernelIyEvT_PhS0_PKS0_S0_+000630  c:\users\spyffe\documents\visual studio 2015\projects\projecteuler\euler10\euler10.cu:30
 209820590     8    mis ld    g          36     255         {36,0,0}  {255,0,0}  _Z12sieve_kernelIyEvT_PhS0_PKS0_S0_+000630  c:\users\spyffe\documents\visual studio 2015\projects\projecteuler\euler10\euler10.cu:30

Summary of access violations:
c:\users\spyffe\documents\visual studio 2015\projects\projecteuler\euler10\euler10.cu(30): error MemoryChecker: #misaligned=32  #invalidAddress=0
================================================================================

Memory Checker detected 32 access violations.
error = misaligned load (global memory)
gridid = 31
blockIdx = {36,0,0}
threadIdx = {224,0,0}
address = 0x209820590
accessSize = 8

The contents of local variables at the first site that the Nsight debugger showed are

@flatBlockIdx	36	long
		@flatThreadIdx	224	long
+		_ZTVN10__cxxabiv117__class_type_infoE	0x0000000209400000  {0}	const long long[1] __device__
+		_ZTVN10__cxxabiv120__si_class_type_infoE	0x0000000209400100  {0}	const long long[1] __device__
+		_ZTVSt9exception	0x0000000209400200  {0, 0, 0, 0, ...}	const long long[5] __device__
+		_ZTVSt13runtime_error	0x0000000209400300  {0, 0, 0, 0, ...}	const long long[5] __device__
+		_ZTVSt14error_category	0x0000000209400400  {0, 0, 0, 0, ...}	const long long[9] __device__
+		_ZTVSt13_System_error	0x0000000209400500  {0, 0, 0, 0, ...}	const long long[5] __device__
+		_ZTVSt12system_error	0x0000000209400600  {0, 0, 0, 0, ...}	const long long[5] __device__
+		_ZTVSt23_Generic_error_category	0x0000000209400700  {0, 0, 0, 0, ...}	const long long[9] __device__
+		_ZTVSt24_Iostream_error_category	0x0000000209400800  {0, 0, 0, 0, ...}	const long long[9] __device__
+		_ZTVNSt8ios_base7failureE	0x0000000209400900  {0, 0, 0, 0, ...}	const long long[5] __device__
+		threadIdx	{x = 224, y = 0, z = 0}	const uint3
+		blockIdx	{x = 36, y = 0, z = 0}	const uint3
+		blockDim	{x = 512, y = 1, z = 1}	const dim3
+		gridDim	{x = 3779, y = 1, z = 1}	const dim3
		@gridId	31	const long long
		candidate_divisor	'candidate_divisor' has no value at the target location.	
		remainder	'remainder' has no value at the target location.	
		prime_idx	4274	unsigned long long
		sieve_offset	18656	unsigned long long
		candidate	84193	unsigned long long
		is_prime	false	bool
		sieve_base	65537	__parameter__ unsigned long long
+		sieve	0x0000000209a00000  1 ''	__device__ unsigned char* __parameter__
		sieve_size	1934464	__parameter__ unsigned long long
+		primes	0x0000000209818000  2	__device__ const unsigned long long* __parameter__
		num_primes	6542	__parameter__ unsigned long long

In particular, 0x209820590 is 34192 bytes into the primes array, which is to say primes[4274] (which agrees with the value shown for prime_idx and is less than num_primes).

In case you suspect an out-of-bounds access, lines 198-199 and 222-223 in euler10.cu show that primes is always adequately sized to hold num_primes elements. I’ve found and fixed other out-of-bounds accesses with the help of the memory checker; I know it reports those correctly.

Any chance these are really out-of-bounds accesses that are mis-reported as misaligned accesses?

Thanks for your reply, njuffa! I think it’s helped me narrow down the odd behavior. The short answer is:

  • Yes, I'm sure I allocated enough space in primes (see below for how I enforced that);
  • However, I get inconsistent errors. Now I get an "access violation";
  • The faulting address appears to have an odd truncation, which might be the problem!

The long answer:

I looked into your question and added some guards. Specifically, I ensured that the size of the primes allocation was always stored with the allocation itself, by modifying the ptr class. Now its definition is as follows. Note the built-in error checking (which was there before) and the new size and count methods.

template <typename T> struct ptr {
	private:
		T *p = nullptr;
		size_t s;
	public:
		ptr(size_t size = sizeof(T)) : s(size) {
			if (maybe_report_error("Couldn't malloc on device: ",
				cudaMalloc(&p, size)) != cudaSuccess) {
				p = nullptr;
				s = 0;
			}
		}

		ptr(ptr<T> &&rhs) {
			p = rhs.p;
			s = rhs.s;
			rhs.p = nullptr;
			rhs.s = 0;
		}

		ptr(const std::vector<T> &init) {
			s = init.size() * sizeof(T);
			if (maybe_report_error("Couldn't malloc on device: ",
				cudaMalloc(&p, s)) != cudaSuccess) {
				p = nullptr;
				s = 0;
				return;
			}
			if (maybe_report_error("Couldn't memcpy to device: ",
				cudaMemcpy(p, init.data(), s, cudaMemcpyHostToDevice)) != cudaSuccess) {
				cudaFree(p);
				s = 0;
				p = nullptr;
			}
		}

		ptr<T> &operator=(ptr<T> &&rhs) {
			if (p) {
				cudaFree(p);
				p = nullptr;
				s = 0;
			}
			p = rhs.p;
			s = rhs.s;
			rhs.p = nullptr;
			rhs.s = 0;
			return *this;
		}

		~ptr() {
			if (p) {
				cudaFree(p);
				p = nullptr;
				s = 0;
			}
		}

		T *get() {
			return p;
		}

		const T *get() const {
			return p;
		}

		size_t size() const {
			return s;
		}

		size_t count() const {
			return s / sizeof(T);
		}
	};

This means that the call to the kernel now looks like this

sieve_kernel<PrimeType> <<<dims.first, dims.second>>>(
				sieve_base,      // sieve_base
				sieve.get(),     // sieve
				sieve_size,      // sieve_size
				primes.get(),    // primes
				primes.count()); // num_primes

So you see that the inner loop of the kernel

for (size_t prime_idx = 0;
		prime_idx < num_primes;
		++prime_idx) {
		const PrimeType candidate_divisor = primes[prime_idx]; // line 30, the fault is here
		const PrimeType remainder = candidate % candidate_divisor;
		is_prime &= (remainder != 0);
	}

should not walk outside of the array.

I tried another run with this setup, and got a new log.

CUDA context created : 2080037e100
CUDA module loaded:   2080bbb58e0 euler10.cu
CUDA grid launch failed: CUcontext: 2233386656000 CUmodule: 2233579821280 Function: _Z12sieve_kernelIyEvT_PhS0_PKS0_S0_
CUDA context created : 26a18830680
CUDA module loaded:   26a26203da0 euler10.cu
CUDA grid launch failed: CUcontext: 2654701028992 CUmodule: 2654929436064 Function: _Z12sieve_kernelIyEvT_PhS0_PKS0_S0_
CUDA context created : 1d945982eb0
CUDA module loaded:   1d950d36a10 euler10.cu
CUDA grid launch failed: CUcontext: 2032687132336 CUmodule: 2032875563536 Function: _Z12sieve_kernelIyEvT_PhS0_PKS0_S0_
CUDA context created : 1c72f90da80
CUDA module loaded:   1c73d1021a0 euler10.cu
================================================================================
CUDA Memory Checker detected 31 threads caused an access violation:
Launch Parameters
    CUcontext    = 1c72f90da80
    CUstream     = 1c7319a1150
    CUmodule     = 1c73d1021a0
    CUfunction   = 1c7449bdb90
    FunctionName = _Z12sieve_kernelIyEvT_PhS0_PKS0_y
    GridId       = 31
    gridDim      = {3779,1,1}
    blockDim     = {512,1,1}
    sharedSize   = 256
    Parameters:
        sieve_base = 65537
        sieve = 0x0000000b09600000  1 ''
        sieve_size = 1934464
        primes = 0x0000000b09818000  2
        num_primes = 6542
    Parameters (raw):
         0x00010001 0x00000000 0x09600000 0x0000000b
         0x001d8480 0x00000000 0x09818000 0x0000000b
         0x0000198e 0x00000000
GPU State:
   Address  Size      Type  Mem       Block  Thread         blockIdx  threadIdx                                        PC  Source
---------------------------------------------------------------------------------------------------------------------------------
  09824ae8     8    adr ld    g           6     385          {6,0,0}  {385,0,0}  _Z12sieve_kernelIyEvT_PhS0_PKS0_y+000630  c:\users\spyffe\documents\visual studio 2015\projects\projecteuler\euler10\euler10.cu:30
  09824ae8     8    adr ld    g           6     386          {6,0,0}  {386,0,0}  _Z12sieve_kernelIyEvT_PhS0_PKS0_y+000630  c:\users\spyffe\documents\visual studio 2015\projects\projecteuler\euler10\euler10.cu:30
  09824ae8     8    adr ld    g           6     387          {6,0,0}  {387,0,0}  _Z12sieve_kernelIyEvT_PhS0_PKS0_y+000630  c:\users\spyffe\documents\visual studio 2015\projects\projecteuler\euler10\euler10.cu:30
  09824ae8     8    adr ld    g           6     388          {6,0,0}  {388,0,0}  _Z12sieve_kernelIyEvT_PhS0_PKS0_y+000630  c:\users\spyffe\documents\visual studio 2015\projects\projecteuler\euler10\euler10.cu:30
  09824ae8     8    adr ld    g           6     389          {6,0,0}  {389,0,0}  _Z12sieve_kernelIyEvT_PhS0_PKS0_y+000630  c:\users\spyffe\documents\visual studio 2015\projects\projecteuler\euler10\euler10.cu:30
  09824ae8     8    adr ld    g           6     390          {6,0,0}  {390,0,0}  _Z12sieve_kernelIyEvT_PhS0_PKS0_y+000630  c:\users\spyffe\documents\visual studio 2015\projects\projecteuler\euler10\euler10.cu:30
  09824ae8     8    adr ld    g           6     391          {6,0,0}  {391,0,0}  _Z12sieve_kernelIyEvT_PhS0_PKS0_y+000630  c:\users\spyffe\documents\visual studio 2015\projects\projecteuler\euler10\euler10.cu:30
  09824ae8     8    adr ld    g           6     392          {6,0,0}  {392,0,0}  _Z12sieve_kernelIyEvT_PhS0_PKS0_y+000630  c:\users\spyffe\documents\visual studio 2015\projects\projecteuler\euler10\euler10.cu:30
  09824ae8     8    adr ld    g           6     393          {6,0,0}  {393,0,0}  _Z12sieve_kernelIyEvT_PhS0_PKS0_y+000630  c:\users\spyffe\documents\visual studio 2015\projects\projecteuler\euler10\euler10.cu:30
  09824ae8     8    adr ld    g           6     394          {6,0,0}  {394,0,0}  _Z12sieve_kernelIyEvT_PhS0_PKS0_y+000630  c:\users\spyffe\documents\visual studio 2015\projects\projecteuler\euler10\euler10.cu:30
  09824ae8     8    adr ld    g           6     395          {6,0,0}  {395,0,0}  _Z12sieve_kernelIyEvT_PhS0_PKS0_y+000630  c:\users\spyffe\documents\visual studio 2015\projects\projecteuler\euler10\euler10.cu:30
  09824ae8     8    adr ld    g           6     396          {6,0,0}  {396,0,0}  _Z12sieve_kernelIyEvT_PhS0_PKS0_y+000630  c:\users\spyffe\documents\visual studio 2015\projects\projecteuler\euler10\euler10.cu:30
  09824ae8     8    adr ld    g           6     397          {6,0,0}  {397,0,0}  _Z12sieve_kernelIyEvT_PhS0_PKS0_y+000630  c:\users\spyffe\documents\visual studio 2015\projects\projecteuler\euler10\euler10.cu:30
  09824ae8     8    adr ld    g           6     398          {6,0,0}  {398,0,0}  _Z12sieve_kernelIyEvT_PhS0_PKS0_y+000630  c:\users\spyffe\documents\visual studio 2015\projects\projecteuler\euler10\euler10.cu:30
  09824ae8     8    adr ld    g           6     399          {6,0,0}  {399,0,0}  _Z12sieve_kernelIyEvT_PhS0_PKS0_y+000630  c:\users\spyffe\documents\visual studio 2015\projects\projecteuler\euler10\euler10.cu:30
  09824ae8     8    adr ld    g           6     400          {6,0,0}  {400,0,0}  _Z12sieve_kernelIyEvT_PhS0_PKS0_y+000630  c:\users\spyffe\documents\visual studio 2015\projects\projecteuler\euler10\euler10.cu:30
  09824ae8     8    adr ld    g           6     401          {6,0,0}  {401,0,0}  _Z12sieve_kernelIyEvT_PhS0_PKS0_y+000630  c:\users\spyffe\documents\visual studio 2015\projects\projecteuler\euler10\euler10.cu:30
  09824ae8     8    adr ld    g           6     402          {6,0,0}  {402,0,0}  _Z12sieve_kernelIyEvT_PhS0_PKS0_y+000630  c:\users\spyffe\documents\visual studio 2015\projects\projecteuler\euler10\euler10.cu:30
  09824ae8     8    adr ld    g           6     403          {6,0,0}  {403,0,0}  _Z12sieve_kernelIyEvT_PhS0_PKS0_y+000630  c:\users\spyffe\documents\visual studio 2015\projects\projecteuler\euler10\euler10.cu:30
  09824ae8     8    adr ld    g           6     404          {6,0,0}  {404,0,0}  _Z12sieve_kernelIyEvT_PhS0_PKS0_y+000630  c:\users\spyffe\documents\visual studio 2015\projects\projecteuler\euler10\euler10.cu:30
  09824ae8     8    adr ld    g           6     405          {6,0,0}  {405,0,0}  _Z12sieve_kernelIyEvT_PhS0_PKS0_y+000630  c:\users\spyffe\documents\visual studio 2015\projects\projecteuler\euler10\euler10.cu:30
  09824ae8     8    adr ld    g           6     406          {6,0,0}  {406,0,0}  _Z12sieve_kernelIyEvT_PhS0_PKS0_y+000630  c:\users\spyffe\documents\visual studio 2015\projects\projecteuler\euler10\euler10.cu:30
  09824ae8     8    adr ld    g           6     407          {6,0,0}  {407,0,0}  _Z12sieve_kernelIyEvT_PhS0_PKS0_y+000630  c:\users\spyffe\documents\visual studio 2015\projects\projecteuler\euler10\euler10.cu:30
  09824ae8     8    adr ld    g           6     408          {6,0,0}  {408,0,0}  _Z12sieve_kernelIyEvT_PhS0_PKS0_y+000630  c:\users\spyffe\documents\visual studio 2015\projects\projecteuler\euler10\euler10.cu:30
  09824ae8     8    adr ld    g           6     409          {6,0,0}  {409,0,0}  _Z12sieve_kernelIyEvT_PhS0_PKS0_y+000630  c:\users\spyffe\documents\visual studio 2015\projects\projecteuler\euler10\euler10.cu:30
  09824ae8     8    adr ld    g           6     410          {6,0,0}  {410,0,0}  _Z12sieve_kernelIyEvT_PhS0_PKS0_y+000630  c:\users\spyffe\documents\visual studio 2015\projects\projecteuler\euler10\euler10.cu:30
  09824ae8     8    adr ld    g           6     411          {6,0,0}  {411,0,0}  _Z12sieve_kernelIyEvT_PhS0_PKS0_y+000630  c:\users\spyffe\documents\visual studio 2015\projects\projecteuler\euler10\euler10.cu:30
  09824ae8     8    adr ld    g           6     412          {6,0,0}  {412,0,0}  _Z12sieve_kernelIyEvT_PhS0_PKS0_y+000630  c:\users\spyffe\documents\visual studio 2015\projects\projecteuler\euler10\euler10.cu:30
  09824ae8     8    adr ld    g           6     413          {6,0,0}  {413,0,0}  _Z12sieve_kernelIyEvT_PhS0_PKS0_y+000630  c:\users\spyffe\documents\visual studio 2015\projects\projecteuler\euler10\euler10.cu:30
  09824ae8     8    adr ld    g           6     414          {6,0,0}  {414,0,0}  _Z12sieve_kernelIyEvT_PhS0_PKS0_y+000630  c:\users\spyffe\documents\visual studio 2015\projects\projecteuler\euler10\euler10.cu:30
  09824ae8     8    adr ld    g           6     415          {6,0,0}  {415,0,0}  _Z12sieve_kernelIyEvT_PhS0_PKS0_y+000630  c:\users\spyffe\documents\visual studio 2015\projects\projecteuler\euler10\euler10.cu:30

Summary of access violations:
c:\users\spyffe\documents\visual studio 2015\projects\projecteuler\euler10\euler10.cu(30): error MemoryChecker: #misaligned=0  #invalidAddress=32
================================================================================

Memory Checker detected 31 access violations.
error = access violation on load (global memory)
gridid = 31
blockIdx = {6,0,0}
threadIdx = {385,0,0}
address = 0x09824ae8
accessSize = 8

Note that if not for this odd truncation the failing address (0x09824ae8, which I assume is truncated from 0xb09824ae8) minus primes (0xb09818000) would be 0xcae8 – divided by sizeof(uint64_t) and converted to decimal, we get prime_idx or 6493, which is less than num_primes (6542).

What’s odd is that this truncation doesn’t always take place. Observe in the log from my first post, the faulting address is reported as 0x209820590, which is larger than 32 bits!

(I did consider that my code could be the one doing the truncating, but the log from this run shows primes = 0x0000000b09818000 in the argument list, which is not truncated.)