atomicCAS() on two fields at once

akavo · July 18, 2011, 3:18pm

I need to do atomic compare and swap operations on two fields at once, a pointer and a boolean.

Sort of like what AtomicMarkableReference offers in Java.

However, I am well aware that CUDA does not provide any atomic multi CAS operations.

One way to get around this is to use the last bit of the pointer as a mark bit assuming that it is unused because the pointers to allocated memory are aligned in a certain way.

Here is what it says in a paper that I was reading:

In many modern architectures, a 32-bit word that stores a pointer has two unused bits. One of those can be used to store the mark bit....

So my questions are:

Is this safe to assume for CUDA too? If not can I use atomicCAS(long long…) version to acomplish this?
How can I use bitwise operations to set the last bit of a 32-bit pointer to 0 or 1? I haven’t used them before so I’m kind of lost.

Thanks for any suggestions.

akavo · July 18, 2011, 3:47pm

Ok, so looked at a couple of things about bitwise operations:

So can someone confirm this is correct:
Given a pointer (ptr), ptr XOR 1 wouldset the last bit to 0;
ptr OR 1 would set the last bit to 1.

tera · July 18, 2011, 4:12pm

WARNING: dirty hacks ahead!

It’s maybe cleanest to use atomicCAS(long long) on a structure containing pointer and boolean, however that would limit your code to 32 bit mode.

Pointers in current versions of CUDA are always aligned, so that trick should work. Pre-Fermi hardware even masks off the unused lowest bits (can’t remember off my head what Fermi does), although I would recommend to explicitly mask off the bits before using the variable as a pointer.

You’ll have to cast the pointer to size_t before being able to apply bitops. Note that by casting to size_t and back the nature of the pointer will get lost and the compiler will always assume it points to global memory. If you want to avoid that, you can cast the pointer to char*, add or subtract as necessary to change the relevant bits, and then cast back to the original pointer type.

Please don’t take this info as a recommendation.

tera · July 18, 2011, 4:15pm

ptr XOR 1 only clears the least significant bit if you know it was set before. ptr AND ~1ul works in the general case.

akavo · July 18, 2011, 5:24pm

WARNING: dirty hacks ahead!

It’s maybe cleanest to use atomicCAS(long long) on a structure containing pointer and boolean, however that would limit your code to 32 bit mode.

Pointers in current versions of CUDA are always aligned, so that trick should work. Pre-Fermi hardware even masks off the unused lowest bits (can’t remember off my head what Fermi does), although I would recommend to explicitly mask off the bits before using the variable as a pointer.

You’ll have to cast the pointer to size_t before being able to apply bitops. Note that by casting to size_t and back the nature of the pointer will get lost and the compiler will always assume it points to global memory. If you want to avoid that, you can cast the pointer to char*, add or subtract as necessary to change the relevant bits, and then cast back to the original pointer type.

Please don’t take this info as a recommendation.

Of course, I’ll experiment with this and won’t take it with blind faith. However, I’m trying to think about the details and your tips are very helpful.

So how would I use atomicCAS(long long) with a struct. Should I just make a struct{unsigned long ptr, unsigned long mark} and typecast it to long long? But I want to try using the 32-bit version of atomicCAS if possible because I’m worried about the performance of the 64 bit version. This operation will be called fairly often and it’ll be the linearization point so I’d like to remove any dead weight.

Thanks for the tip.

akavo · July 18, 2011, 5:26pm

Oh and another thing. Am I correct to assume that such a pointer with the least significant bit set to 1 should not be dereferenced or it will point to the wrong location?

tera · July 18, 2011, 6:02pm

I’d better not dereference it. It actually works with the expected result on pre-Fermi, but I can’t recall whether Fermi’s better error-checking catches the misaligned access or not.

tera · July 18, 2011, 6:04pm

Future hardware might well just perform the misaligned access, which would of course spoil the whole scheme. So better clear the bits before dereferencing the pointer.

akavo · July 18, 2011, 6:24pm

When you point out the differences between pre-fermi and fermi, it naturally leads to a good point which is that it would be unwise to rely on hardware behavior that might be subject to change. So I better reset the bit to 0.

Thanks for your help.