Concurrent writing to a global variable

Hello,

I am working on a program to compute the minimal distance of a 3D object (represented by a sphere tree) to a pointcloud (represented as a 1D array of points).
I decided to parallelize over the amount of points, so threads do a device function to get the smallest distance from the sphere tree to that point. The tree enables me to check the distance of inner nodes and compare it to the global minimum, so I can discard huge chunks of data, without checking everything.

So to not have side effects of concurrent writing to global minimum, I decided to introduce a global lock variable (since atomicCAS does not work on float and double), that is set when a thread compares the distance from his point to a leaf and then updates the global minimum, if the new value is smaller.

Sadly, my lock mechanism always ends up acting as an infinite loop, which crashes my computer.

This is how I implemented the locking and unlocking:
(m_dev_lock is a global device int initialized with 0)

__device__ void dev_lockGlobal() {
	while(atomicCAS(&m_dev_lock, 0, 1) != 0);
}

__device__ void dev_unlockGlobal() {
	atomicExch(&m_dev_lock, 0);
}

The device function that uses the lock:

__device__ void ParallelDistance::dev_minDistanceNode_recursive(SphereNode *node, const Vec3f &point) {
	/* Current node is a leaf */
	if(node->m_isLeaf) {
		double new_dist = dual_calcMinDistance(node, point);

		/* Lock */
		dev_lockGlobal();

		/* > Critical Section */
		/* If any children are nearer, their distance becomes the current minimum */
		if(new_dist < m_dev_minimum_global) {
			m_dev_minimum_global = new_dist;
		}

		/* Release */
		dev_unlockGlobal();
	}

The device function is called from this kernel:

__global__ void kernel_minDistanceTree(const SphereTree tree, const Vec3f *points, const unsigned int point_count) {
	/* Declarations, static allocation */
	unsigned int tid = threadIdx.x + blockIdx.x * blockDim.x;

	/* If too many threads were started, the excessive ones exit here */
	if(tid < point_count) {
		/* This loop enables less threads than points to work on all points */
		do {
			/* Calculate the minimal distance */
			ParallelDistance::dev_minDistanceNode_recursive(tree.m_dev_rootNode, points[tid]);

			/* Increment step */
			tid += blockDim.x * gridDim.x;
		} while(tid < point_count);
	}
}

Does someone have a solution to my problem?
I’m not even sure if global memory is the right way to go, but it seems better than shared memory to me, since a lot of tree branches can be discarded if all threads have the smallest distance from every other block. But that wouldn’t help the overall problem anyway, since there would still be concurrent writes to a double.

I use the same locking/unlocking code, except that I explicitly cudaMalloc my global spinlock, and I cudaMemset it to 0 after allocation. I do not have to use the address operator & in atomicCAS/Exch() because I would pass the lock pointer to the kernel as a pointer data type (and into the lock/unlock functions):

int *m_dev_lock

One more thing. I only use the first thread in each warp for locking/unlocking. In my implementations individual warps compete for the lock.

32 threads (one warp) always operate synchronously, per hardware design, i.e. there is only one program counter (PC) per warp. Different threads in one warp cannot acquire the lock at the same time! If you tried, you would get a deadlock.

What happens if you have two or more threads in a warp that both need your lock? You will have to serialize the lock acquisition procedure then, thread by thread.

I remember that I’ve previously seen a lock() function posted in these forums which would return the thread ID that has currently acquired the lock - and only that single thread would be allowed to access the protected resource. The implementation would spin until all threads in a warp (or block) are satisfied. I cannot provide a link to the forum posting though, sorry.

I think the locking issue is unnecessary anyway. You could do without the locks and try to use AtomicMin in place of this code

if(new_dist < m_dev_minimum_global) {
    m_dev_minimum_global = new_dist;
}

if you’re on older CUDA hardware that doesn’t offer atomicMin/Max for floats yet, there are workarounds that provide this feature.

Thanks for the reply,

I currently use 256 threads per block, how do I know which of those belongs to which warp, and isn’t that gpu-specific? How would I make sure only one thread of a warp locks the mutex?

Regarding atomicMin, I don’t see a float version of it in the CUDA programming guide, or is it the 64-bit version for 3.5-sm and above? I think I could get a suitable GPU, if it solves the issue.
With workaround you mean converting to int, comparing and converting back?

The CUDA compile time constant warpSize gives you the warp size. So far it has always been 32. In a one dimensional block, you can use threadIdx.x/warpSize to get the warp index.

floats can be interpreted as integers using some intrinsics: __float_as_int() and __int_as_float(). The binary representation of the floating point number is directly mapped to an integer number (note that this isn’t VALUE-preserving, apparently).

This will always hold true

float a,b;
if a < b then __float_as_int(a) < __float_as_int(b)

Floating point numbers can be sorted numerically by lexically sorting their binary representations instead. That’s a really useful property of the IEEE 754 floating point representation. So you can use integer atomicMin/Max to reliably find the smallest floating point value.

Your problem (finding the smallest value from results generated by individual threads) also seems to be related to the task of performing a “parallel reduction”. There are indeed some very efficient implementations available, most of them using shared memory. Atomics aren’t very fast on older GPUs - but have been made much more efficient on Kepler devices. Only benchmarking your application will tell.

Thanks a lot.
I will try out what works fastest for me

I tried using __float_as_int with atomicMin, from a few test-runs that seemed to work. The downside to that is, currently I do a lot of conversions, which shouldn’t really be necessary (I think) if I settle for this method.

I also tried updating my lock, so that only threads, which have index=0 in their respective warp, to lock and later release the lock, I still get the same error when running with full grid of <<<256, 256>>> (screen freezes because GPU is stuck in infinite loop).
It does how every work if the workload doesn’t exceed one block =/

My additional implementation is basically

if((threadIdx.x & (warpSize-1)) != 0)
		return;

In the beginning of both lock and unlock.

EDIT: Tried using shared memory, but I get like 1.5-2x longer execution times as I expected, it’s just more efficient if all blocks work together, even over slow memory.

__float_as_int() and __int_as_float() have zero overhead as far as I know. There is no type conversion involved - just a reinterpretation of a 32 bit register’s contents.

I still need to hold the value in an integer device variable, but still need the float to translate it back before using it on the host, or is there a equivalent host function to __int_as_float?

int host_val = 0x12345678;
float test = *((float*)&host_val)

this code effectively does an __int_as_float() on the host on the value stored in host_val.

Oh, that’s easy, thanks for that.