Error: "unaligned memory accesses not supported"?

I get this error when I try to use some shared memory:

extern __shared__ char shared_mem[];

unsigned long** sptr = (unsigned long**)shared_mem;

sptr[0] = NULL;

I have a lot more code, but this suffices for the error. What is this error “unaligned memory accesses not supported” and why do I get this? Even more strangely if I do the following, the kernel compiles and runs.

if (threadIdx.x == 0) sptr[threadIdx.x] = NULL;

Even more strangely…my real life problem: I have implemented a small subset if vector as bit-values which I want to put into shared memory for fast (read) access. However, for design reasons I cannot modify the backend that uses these values, so the change has to be transparent as far as the calling functions are concerned. What I have to do is copy all bit-vectors to shared memory, backup the pointers, let pointers point to shared memory and after the function is done, put back the backup pointers to their original locations.

So the kernel code is something like . Now the very, very strange thing is when I restore more than a half-warp size of bit-vectors. If the second threadIdx.x is > 15 then the launch fails… even if I use only 16 threads and have each thread backup/restore two elements with ofset 16, and as above, if I directly do the backup/restore with indexes the kernel doesn’t compile:

if (threadIdx.x < VECTOR_COUNT) sptr[threadIdx.x] = structure[threadIdx.x];




if (threadIdx.x < 16) structure[threadIdx.x] = sptr[threadIdx.x];

my bit-vector is as follows (heavily borrowed from vector and only the minimum supported). It is a small iterator, a length value and a pointer to a large pool of vectors where all bit-vectors are sequentially stored and we index in there:

template <> class myIterator<bool> {


	typedef std::forward_iterator_tag iterator_category;

	typedef uint32_t			   value_type;

	typedef uint32_t			   size_type;

	typedef value_type*			pointer;

	typedef myIterator<bool>	   iterator;

	__device__ myIterator() {}

	__device__ myIterator(pointer val): _myval(val), _offset(0) {}

	enum {VBYTES = sizeof(value_type), VBITS = 8 * VBYTES};


	__device__ inline operator bool() const {return (*_myval & _mask()) != 0;}

	__device__ inline iterator operator=(bool val) {

		if (val)


			*_myval |= _mask();




			*_myval &= ~_mask();



		return *this;


	__device__ inline iterator operator+(size_type off) {

		_offset += off;

		_myval  += _offset / VBITS;

		_offset %= VBITS;

		return *this;



	__device__ inline size_type _mask() const {return (size_type)(1 << _offset);}

	size_type _offset;

	pointer _myval;


template<> class mycudaArray<bool>: public myIterator<bool> {


	typedef bool				value_type;

	typedef myIterator<bool>	iterator;

	typedef myIterator<bool>	const_iterator;

	typedef uint32_t*		   pointer;

	typedef uint32_t			size_type;

	mycudaArray(size_type len = 0, value_type* data = NULL): len(len), data(reinterpret_cast<pointer>(data)) {}

	size_type len;

	pointer data; /* contiguous data array, should reside on GPU */

	__device__ inline size_type size() const {return len;}

	__device__ inline bool empty() {return data == NULL;}

	__device__ inline iterator operator[](size_type i) {return begin() + i;}

	__device__ inline const_iterator operator[](size_type i) const {return begin() + i;}

	__device__ inline iterator begin() {return iterator(data);}

	__device__ inline const_iterator begin() const {return const_iterator(data);}

	__host__ __device__ static inline size_type size_of(size_type i) {return (i / 8) + ((i % 8) != 0);}


Note: Obviously we cannot dynamically allocate vectors of any size of the device, but luckily once they’re set up these vectors remain of the same length. So I scan all bit-vectors, allocate memory to them in a single block, copy over the values and use that.

Can someone shed light on my problem? :ph34r: What the hell is going on?

Just declare the shared_mem as

__shared__ long shared_mem[]

This sounds like trouble because mixing pointers to shared memory and pointers to global memory can cause the compiler to assume global memory. When dereferencing a shared pointer, this will cause problems. This may or may not be your problem.

This may also be a problem:

__device__ inline iterator operator=(bool val) {

		if (val)


			*_myval |= _mask();




			*_myval &= ~_mask();



		return *this;


You may have a race condition on updating *_myval since multiple threads can be modifying a single element. Even though they have different indexes, they translate into the same underlying element.

Yes, thanks, I did that. The compile error went away, the kernel still failed however. Though if I did not index the shared memory directly but through threadIdx.x, as posted before, it compiles :wacko:

Jamie, thank you very much for your reply. I did not even realise this potential race condition problem, so focused was I on getting a smaller size which’ll fit in shared memory. However, as this’ll be used only for reading and not writing I can assume it’ll work correctly. Even if working incorrectly it shouldn’t crash, right? I am assuming again, but the compiler knows by the time it starts to read data that it’s pointing to shared memory…but who knows.

Though I now have to revisit my design decisions as writing is potentially a big hazard and guarding with atomic instructions, etc. is worth the extra space in global memory… I’m thinkng not, but we’ll see.

However, for interested parties I have spent all day yesterday figuring out what was the actual problem, “debugging” the kernel as good as possible. Really annoying to not be able to debug on the GPU, emulation mode is not the same; but that’s beside the point :)

So. Indeed, the crash happens when the bit-vector is reassigned to shared memory and a value there is accessed. As I seem to have found something pretty interesting, actually having the kernel NOT crash, I’ll post code-snippets below.

The original code is shown below. We access the third bit of the fourth element of a structure, after reassigning the data to shared memory:

structure[4] = &shared[0];

bool test = structure[4].fires[3];

Internally, the operator returns begin() + index, which is an iterator. The operator+ offsets from base, and finally since we store the result in a bool, the operator bool() returns the actual boolean value. See my original post, or wade through std::vector :).

Now let’s see what doesn’t crash

structure[4] = &shared[0];

myIterator<bool> it(structure[4];

bool test = it.operator+(3);

Yes, :blink: mindboggling isn’t it? I make a few steps explicit and it works?? Replace begin() by creating a temporary iterator, calling the + operator and all’s fine? I really wonder what is going on here. Now even more surprising, I actually thought I was done, so put this code back into operator, and voila, crash again… ookay :w00t: Let’s create a custom function inside mycudaArray:

__device__ inline bool get(size_type i) const {

  iterator it(data);

  bool ret = it.operator+(i);

  return ret;



bool test = structure[4].fires.get(3);

Yes, we crash again :). I think I’m going to give up. I am starting to think NVCC compiler problems, non-fully supported C++ device code, etc. Either way, this bit-vector is full of race-conditions, so for anything else than reading it’ll be too expensive to use. I love to waste a few days on these things :P


You cant have the same pointer point to shared memory as well as global memory at different points in time. It will always confuse the compiler. It is better to stay away from it.

Try this trick, if you would like:

  1. Declare a volatile pointer. Like
volatile int *p;
  1. Assign it explicitly to a shared memory variable… Say
__shared__ volatile int k; p=&k;

This is a hint you give to compiler that p will actually point to shared memory in the code that follows. So, Do this before assigning p to something in shared memory. For such accesses, compiler will (may) generate shared memory access code.

Similarly do the same before accessing global memory.

We are just giving clues to the compiler. I have seen such a thing work before. Check out.

Thank you Sarnath for the clues. I tried to use volatile, but still no luck. It seems the code is too “complex” for the compiler to figure out :). A very simple example worked…but there I did not even have to use the volatile keyword.

Oops… However I feel, we can solve it, if we fool around a bit with the code :-)


We have been asking nVIDIA for “pointer” qualification for long…

Where a pointer resides and what type of memory points to are 2 different things. And the latter can keep varying over run-time.

So,we need a dynamic way to specify where a pointer points to avoid clumsy stuff like this.