Multiple Blocks writing to same location Consistency during memory address collisions

SevenThunders · October 28, 2010, 11:56pm

I have a peculiar reduction problem. I want to find the FIRST occurance of a float32 above a given threshold.

One crazy thought I had was to simply create a small number of global device ints and write to them if I found

values above the threshold.

So for example, suppose I am recording indices into two buckets, those smaller than half the indices and those larger.

So I have a code snippet in my device function that looks like this:

if (mxvals[0] > thresh) {

	if (r < r2) {

		ixrc[0] = r;

		ixrc[1] = iarr[0];

	} else {

		ixrc[2] = r;

		ixrc[3] = iarr[0];

	}

ixrc[0] and ixrc[1] are for values I find that have time indices less than r2 and ixrc[2] and ixrc[3] are for indices greater than r2.

The 4 element ixrc array is in global memory and thus there will be multiple write collisions to this memory. I initialize them to invalid values prior

to running the kernel.

Now this algorithm

doesn’t care about the collisions as long as one of them wins and is consistent. Is there any guarantee that that will be the case?

First will I get valid data in ixrc and second do i have any assurance that the value in ixrc[0] is properly associated with that in ixrc[1]?

My simple experiments with small arrays suggest that the memory assignments are consistent, and that the last block indices seem to get written last.

I suspect I may have to fix it so that I don’t require consistency between ixrc[0] and ixrc[1].

My algorithm will use any values it finds over threshold to bound the problem and set ranges for further runs, in a kind of divide and conquer fashion. Statistically the values above threshold will be quite rare.

SevenThunders · October 28, 2010, 11:56pm

I have a peculiar reduction problem. I want to find the FIRST occurance of a float32 above a given threshold.

One crazy thought I had was to simply create a small number of global device ints and write to them if I found

values above the threshold.

So for example, suppose I am recording indices into two buckets, those smaller than half the indices and those larger.

So I have a code snippet in my device function that looks like this:

if (mxvals[0] > thresh) {

	if (r < r2) {

		ixrc[0] = r;

		ixrc[1] = iarr[0];

	} else {

		ixrc[2] = r;

		ixrc[3] = iarr[0];

	}

ixrc[0] and ixrc[1] are for values I find that have time indices less than r2 and ixrc[2] and ixrc[3] are for indices greater than r2.

The 4 element ixrc array is in global memory and thus there will be multiple write collisions to this memory. I initialize them to invalid values prior

to running the kernel.

Now this algorithm

doesn’t care about the collisions as long as one of them wins and is consistent. Is there any guarantee that that will be the case?

First will I get valid data in ixrc and second do i have any assurance that the value in ixrc[0] is properly associated with that in ixrc[1]?

My simple experiments with small arrays suggest that the memory assignments are consistent, and that the last block indices seem to get written last.

I suspect I may have to fix it so that I don’t require consistency between ixrc[0] and ixrc[1].

My algorithm will use any values it finds over threshold to bound the problem and set ranges for further runs, in a kind of divide and conquer fashion. Statistically the values above threshold will be quite rare.

Sarnath · October 29, 2010, 9:16am

ixrc[0] and ixrc[1] will not be correlated. They are not atomic writes. However, the value written in them will be valid.

Check atomic operations in CUDA – they are very slow and u have to be prudent. But since the frequency is less, you can give it a try.

You may need to implement a global lock and then update ixrc[0] and [1] before releasing the lock. There r other threads in the forum that discuss how to implement a lock (which is a bit tricky in CUDA)

Sarnath · October 29, 2010, 9:16am

ixrc[0] and ixrc[1] will not be correlated. They are not atomic writes. However, the value written in them will be valid.

Check atomic operations in CUDA – they are very slow and u have to be prudent. But since the frequency is less, you can give it a try.

You may need to implement a global lock and then update ixrc[0] and [1] before releasing the lock. There r other threads in the forum that discuss how to implement a lock (which is a bit tricky in CUDA)

maringanti · October 29, 2010, 9:24am

When you mean consistency, do you mean that you are expecting the same thread (eg. last thread in last block, first thread in second block etc) to do the writing ?

From the version 3.1 of the guide, section 4.1, last para, “If a non-atomic instruction executed by a warp writes to the same location in global or shared memory for more than one of the threads of the warp, the number of serialized writes that occur to that location varies depending on the compute capability of the device (see Sections G.3.2, G.3.3, G.4.2, and G.4.3) and which thread performs the final write is undefined.”. I don’t think your method is recommended if you are hoping for a particular thread to be always the winner.

maringanti · October 29, 2010, 9:24am

When you mean consistency, do you mean that you are expecting the same thread (eg. last thread in last block, first thread in second block etc) to do the writing ?

From the version 3.1 of the guide, section 4.1, last para, “If a non-atomic instruction executed by a warp writes to the same location in global or shared memory for more than one of the threads of the warp, the number of serialized writes that occur to that location varies depending on the compute capability of the device (see Sections G.3.2, G.3.3, G.4.2, and G.4.3) and which thread performs the final write is undefined.”. I don’t think your method is recommended if you are hoping for a particular thread to be always the winner.

SevenThunders · October 29, 2010, 7:47pm

I don’t care if a particular thread is the winner. All I care about is that my integer values do not get trashed in some way, because there are multiple writes to the same location from multiple threads. I would also care if that scheme were particularly slow, though even in this case, only a very few values are expected to pass the threshold, maybe 1 in 10,000.

So far my scheme seems to be working though I haven’t really done any performance benchmarks. My idea is to create 2 or more buckets, (I decided on 8 in my code) that divide the search space into equal partitions. The integer values in ixrc are initialized to -1. If something is found that passes threshold, it’s index is written into ixrc[k], where k is the bucket index (k = 0…7). I pull ixrc out of the device and check for the first non-negative index if one is found. If I have found something my search space is reduced by a factor of 8 and a little more, since I can always upper bound my indices by the smallest one found so far that passes the threshold.

Thus I don’t care about write collisions as long as the data that finally gets written is valid. So far it seems to be valid. Any write to a location tells me that at least one point was found that passes threshold.

This problem is kind of funny trying to parallelize. I initially had tried to do more of a classic reduction scheme, but I ended up with some kind of memory fault in my index array that I could never debug. cuPrintf was insufficient to the task and ocelot was showing memory faults in areas that seem to run just fine in cuda. So I just ended up doing it this way.

SevenThunders · October 29, 2010, 7:47pm

I don’t care if a particular thread is the winner. All I care about is that my integer values do not get trashed in some way, because there are multiple writes to the same location from multiple threads. I would also care if that scheme were particularly slow, though even in this case, only a very few values are expected to pass the threshold, maybe 1 in 10,000.

So far my scheme seems to be working though I haven’t really done any performance benchmarks. My idea is to create 2 or more buckets, (I decided on 8 in my code) that divide the search space into equal partitions. The integer values in ixrc are initialized to -1. If something is found that passes threshold, it’s index is written into ixrc[k], where k is the bucket index (k = 0…7). I pull ixrc out of the device and check for the first non-negative index if one is found. If I have found something my search space is reduced by a factor of 8 and a little more, since I can always upper bound my indices by the smallest one found so far that passes the threshold.

Thus I don’t care about write collisions as long as the data that finally gets written is valid. So far it seems to be valid. Any write to a location tells me that at least one point was found that passes threshold.

This problem is kind of funny trying to parallelize. I initially had tried to do more of a classic reduction scheme, but I ended up with some kind of memory fault in my index array that I could never debug. cuPrintf was insufficient to the task and ocelot was showing memory faults in areas that seem to run just fine in cuda. So I just ended up doing it this way.

Topic		Replies	Views
Writes in same memory location Cant add numbers from different threads? CUDA Programming and Performance	46	25642	July 5, 2007
CUDA Memory Consistency CUDA Programming and Performance	23	55613	March 8, 2007
Please help with __shared__ memory different usage than in samples CUDA Programming and Performance	30	3329	January 10, 2010
setting bits in shared memory CUDA Programming and Performance	16	17110	June 6, 2007
Concurrent writing to a global variable CUDA Programming and Performance	10	2342	December 7, 2013
Global thread barrier CUDA Programming and Performance	78	85726	December 23, 2011
Help with memory management CUDA Programming and Performance	20	5774	March 27, 2010
Problem with correct branching within a warp CUDA Programming and Performance	23	15664	May 28, 2009
Mutual exclusion or Reduction on global memory? CUDA Programming and Performance	13	9103	September 13, 2008
Device memory access bug? CUDA Programming and Performance	29	2100	June 16, 2011

Multiple Blocks writing to same location Consistency during memory address collisions

Related topics