Multiple Blocks writing to same location Consistency during memory address collisions

I have a peculiar reduction problem. I want to find the FIRST occurance of a float32 above a given threshold.

One crazy thought I had was to simply create a small number of global device ints and write to them if I found

values above the threshold.

So for example, suppose I am recording indices into two buckets, those smaller than half the indices and those larger.

So I have a code snippet in my device function that looks like this:

if (mxvals[0] > thresh) {

	if (r < r2) {

		ixrc[0] = r;

		ixrc[1] = iarr[0];

	} else {

		ixrc[2] = r;

		ixrc[3] = iarr[0];

	}

ixrc[0] and ixrc[1] are for values I find that have time indices less than r2 and ixrc[2] and ixrc[3] are for indices greater than r2.

The 4 element ixrc array is in global memory and thus there will be multiple write collisions to this memory. I initialize them to invalid values prior

to running the kernel.

Now this algorithm

doesn’t care about the collisions as long as one of them wins and is consistent. Is there any guarantee that that will be the case?

First will I get valid data in ixrc and second do i have any assurance that the value in ixrc[0] is properly associated with that in ixrc[1]?

My simple experiments with small arrays suggest that the memory assignments are consistent, and that the last block indices seem to get written last.

I suspect I may have to fix it so that I don’t require consistency between ixrc[0] and ixrc[1].

My algorithm will use any values it finds over threshold to bound the problem and set ranges for further runs, in a kind of divide and conquer fashion. Statistically the values above threshold will be quite rare.

I have a peculiar reduction problem. I want to find the FIRST occurance of a float32 above a given threshold.

One crazy thought I had was to simply create a small number of global device ints and write to them if I found

values above the threshold.

So for example, suppose I am recording indices into two buckets, those smaller than half the indices and those larger.

So I have a code snippet in my device function that looks like this:

if (mxvals[0] > thresh) {

	if (r < r2) {

		ixrc[0] = r;

		ixrc[1] = iarr[0];

	} else {

		ixrc[2] = r;

		ixrc[3] = iarr[0];

	}

ixrc[0] and ixrc[1] are for values I find that have time indices less than r2 and ixrc[2] and ixrc[3] are for indices greater than r2.

The 4 element ixrc array is in global memory and thus there will be multiple write collisions to this memory. I initialize them to invalid values prior

to running the kernel.

Now this algorithm

doesn’t care about the collisions as long as one of them wins and is consistent. Is there any guarantee that that will be the case?

First will I get valid data in ixrc and second do i have any assurance that the value in ixrc[0] is properly associated with that in ixrc[1]?

My simple experiments with small arrays suggest that the memory assignments are consistent, and that the last block indices seem to get written last.

I suspect I may have to fix it so that I don’t require consistency between ixrc[0] and ixrc[1].

My algorithm will use any values it finds over threshold to bound the problem and set ranges for further runs, in a kind of divide and conquer fashion. Statistically the values above threshold will be quite rare.

ixrc[0] and ixrc[1] will not be correlated. They are not atomic writes. However, the value written in them will be valid.

Check atomic operations in CUDA – they are very slow and u have to be prudent. But since the frequency is less, you can give it a try.

You may need to implement a global lock and then update ixrc[0] and [1] before releasing the lock. There r other threads in the forum that discuss how to implement a lock (which is a bit tricky in CUDA)

ixrc[0] and ixrc[1] will not be correlated. They are not atomic writes. However, the value written in them will be valid.

Check atomic operations in CUDA – they are very slow and u have to be prudent. But since the frequency is less, you can give it a try.

You may need to implement a global lock and then update ixrc[0] and [1] before releasing the lock. There r other threads in the forum that discuss how to implement a lock (which is a bit tricky in CUDA)

When you mean consistency, do you mean that you are expecting the same thread (eg. last thread in last block, first thread in second block etc) to do the writing ?

From the version 3.1 of the guide, section 4.1, last para, “If a non-atomic instruction executed by a warp writes to the same location in global or shared memory for more than one of the threads of the warp, the number of serialized writes that occur to that location varies depending on the compute capability of the device (see Sections G.3.2, G.3.3, G.4.2, and G.4.3) and which thread performs the final write is undefined.”. I don’t think your method is recommended if you are hoping for a particular thread to be always the winner.

When you mean consistency, do you mean that you are expecting the same thread (eg. last thread in last block, first thread in second block etc) to do the writing ?

From the version 3.1 of the guide, section 4.1, last para, “If a non-atomic instruction executed by a warp writes to the same location in global or shared memory for more than one of the threads of the warp, the number of serialized writes that occur to that location varies depending on the compute capability of the device (see Sections G.3.2, G.3.3, G.4.2, and G.4.3) and which thread performs the final write is undefined.”. I don’t think your method is recommended if you are hoping for a particular thread to be always the winner.

I don’t care if a particular thread is the winner. All I care about is that my integer values do not get trashed in some way, because there are multiple writes to the same location from multiple threads. I would also care if that scheme were particularly slow, though even in this case, only a very few values are expected to pass the threshold, maybe 1 in 10,000.

So far my scheme seems to be working though I haven’t really done any performance benchmarks. My idea is to create 2 or more buckets, (I decided on 8 in my code) that divide the search space into equal partitions. The integer values in ixrc are initialized to -1. If something is found that passes threshold, it’s index is written into ixrc[k], where k is the bucket index (k = 0…7). I pull ixrc out of the device and check for the first non-negative index if one is found. If I have found something my search space is reduced by a factor of 8 and a little more, since I can always upper bound my indices by the smallest one found so far that passes the threshold.

Thus I don’t care about write collisions as long as the data that finally gets written is valid. So far it seems to be valid. Any write to a location tells me that at least one point was found that passes threshold.

This problem is kind of funny trying to parallelize. I initially had tried to do more of a classic reduction scheme, but I ended up with some kind of memory fault in my index array that I could never debug. cuPrintf was insufficient to the task and ocelot was showing memory faults in areas that seem to run just fine in cuda. So I just ended up doing it this way.

I don’t care if a particular thread is the winner. All I care about is that my integer values do not get trashed in some way, because there are multiple writes to the same location from multiple threads. I would also care if that scheme were particularly slow, though even in this case, only a very few values are expected to pass the threshold, maybe 1 in 10,000.

So far my scheme seems to be working though I haven’t really done any performance benchmarks. My idea is to create 2 or more buckets, (I decided on 8 in my code) that divide the search space into equal partitions. The integer values in ixrc are initialized to -1. If something is found that passes threshold, it’s index is written into ixrc[k], where k is the bucket index (k = 0…7). I pull ixrc out of the device and check for the first non-negative index if one is found. If I have found something my search space is reduced by a factor of 8 and a little more, since I can always upper bound my indices by the smallest one found so far that passes the threshold.

Thus I don’t care about write collisions as long as the data that finally gets written is valid. So far it seems to be valid. Any write to a location tells me that at least one point was found that passes threshold.

This problem is kind of funny trying to parallelize. I initially had tried to do more of a classic reduction scheme, but I ended up with some kind of memory fault in my index array that I could never debug. cuPrintf was insufficient to the task and ocelot was showing memory faults in areas that seem to run just fine in cuda. So I just ended up doing it this way.