Question about AtomicExch()

As an exercise I wrote a CUDA version of the standard subset problem in which a bit mask is used to brute force enumerate all the different subsets out of the 2^n possibilities. The kernel takes in the set size (in this case 30) and the subset size (say 15). The kernel counts the bits in the current index, and if it is equal to the target subset size, then it calculates the sum of values associated with that specific subset
(using a global constant array[30] with the values associated with each element index).

Rather than the usual shared memory method I like to use, I instead attempted to used AtomicMax() to return the best answer with AtomicExch() to store the bitmask pattern associated with that optimal answer.

The overall answer for the optimal value has always been correct(when compared to the CPU result), but the mask answer is sometimes off by 1-3 bits.

I really have not used Atomics much before, so I probably am missing some obvious error probably with the AtomicExch() call.

here are the relevant parts of the code;

#define THREADS 256

__constant__ int d_cost[30]={344,12,55,3,2,6,5,11,2232,44,77,12,31,41,51,66,777,888,1,22,3,71,12,333,44,66,77,98,70,200};//global device cost array

__device__ int d_cntbt(int n){//count the bits of an int on the device
	return (((n+(n>>4))&0x0F0F0F0F)*0x01010101)>>24;

__global__ void _gpu_subset_test(const int set_size, const int subset_size,int *local_max,int *global_max,int *result_mask,int local_sz){

	int idx=threadIdx.x+blockDim.x*blockIdx.x;
	int i2=idx;
	if(idx>=(1<<set_size))return;//out of range

	int d_temp=0,local_idx=threadIdx.x;
	for(int j=0;j<set_size;j++){
	int old_best=atomicMax(&local_max[local_idx],d_temp);//
		atomicMax(global_max,d_temp);//update the value
		atomicExch(result_mask,idx);//update the mask

int main(){...

int num_blocks=(1<<setSize)/THREADS;

_gpu_subset_test<<<num_blocks,THREADS>>>(setSize,subsetSize,d_loc,d_mx,d_rez_mask,THREADS);//launch kernel

if(err!=cudaSuccess){printf("%s in %s at line %d

..then copy back results to host

the local_max array is of size THREADS. I do use cudaMemset(0) for all the pointers/variables before the call.


Since in that global constant array there were a few repeats of the same value, I tested on an array with only unique values, and that did not seem to make a difference.

I realized that there could be more than one combination which resulted in the same optimal value, so I checked if that was the issue, and it appears that is not the case.

The serial cpu version goes in sequential order, and will only update the result mask when there is a better answer for the value of the combination.
As I understand it the GPU will go through the same range of numbers(referring to the value in the kernel of idx ), but not necessarily in sequential order. Because of that I thought it may be seeing first some other optimal combination, and storing that mask with that value.
That does not seem to be that case.

While the CPU and GPU version both generate that optimal answer, the mask for the GPU version is usually incomplete(1-3 less bits set), and not summing to that optimal value.

Just in case I did not explain the code well, here is the cpu implementation which I have been comparing the results of the GPU against;

int _cpu_subset_test(const int set_size, const int subset_size,int &result_mask){//take in result mask by reference
	if(set_size>30){cout<<"Too Large!
";return -1;}
	int ans=0;
	for(int i=0;i<(1<<set_size);i++){
			int temp=0;
			for(int j=0;j<set_size;j++){
	return ans;

So the main issue I having with the GPU is the recording of the integer value which has the bit pattern associated with the optimal subset. The AtomicExch() must be out of sync with the optimal value, because the GPU version returns the right optimal value, but the incorrect bitmask.

I should also mention that the GPU version is 15-50 times faster than the CPU version, but until I get past this recording of the bitmask it is irrelevant.