# Using atomicAdd to step through an array

Hey Everyone,

I have a matrix in compacted form (sorted by column) that I want to step through column by column. I have a counter to keep track of what place I am in within the array, and I want to use atomicAdd to keep track of that place / increment the counter. The code I have in mind would look something like this:

``````int tid = threadIdx.x;

int num = 0;				// number is the size of the num*num matrix

__shared__ int placeholder;	// variable to keep track of how many elements match the current col, num

while(num < N){

tid = placeholder + threadIdx.x; // set tid = threadIdx.x + the placeholder

if(a[tid].x == num){      // check if column matches the num

atomicAdd(&placeholder, 1); // increment place holder for each thread that matches the column num

}

num++;

}
``````

However, when all is said and done, the value of placeholder is only equal to the size of the matrix “num”, but in reality there could be num*num elements. Any thoughts on what I’m doing wrong?

Any help would be greatly appreciated!

One thing to take note: __syncthreads() only does block-wide synchronization.
Is a still the sparse matrix? Are you still launching as many threads as there are elements in a?
As I told you in your last post, this thing really depends on the sparsity of your matrix, so you might want to try the advice I gave first. I understand your original code already comes with the b[r] expanded.

At least these are the things that are definitely beneficial:

1. do not use atomic add, use reduction.
2. only launch as many threads as there are elements in the current column/row. Or if you want to eliminate the multiple launches, then launch as many threads as the maximum number of elements in a column/row. (perhaps making it a multiple of 32 would be good for the sake of reduction… though if you follow 3. then the reduction would be unnecessary as well)
launching N threads, however, makes a bit of sense sometimes as it avoids a bit of bank conflict when accessing b. But this helps only when less than 31/32 of the elements in a column are zeros.
3. It is fine to keep all the elements in the same array - a. But you need to keep counting, perhaps in another counter array, the offset of the starting address of each column/row, at the time when a is produced, so that the a[tid].x == num could be eliminated and that 2. would be easier to do. Also, it’d be better if you align the starting address of each column to 128 byte, through padding, when a is produced, so that there’ll be less memory conflict when accessing a.

you could also put part of b in shared memory.

I do think you should better familiarize yourself with CUDA first, like following the examples in some book (CUDA by Example?).

BTW, I’m just curious. You don’t go through each element N times in your CPU code, do you? I still tend to believe that this kind of algorithm is not well-suited for GPU.

Hey, hyqneuron - thanks for your response to this thread as well. I’ve been meaning to update my old thread to keep you in the loop. This algorithm definitely isn’t greatly suited for parallelization… It has too many data dependencies and as such should probably be kept on the CPU. However, I feel it is a good challenge.

With the code posted in the other thread, I was setting up one thread per matrix element. In this code, I’m setting up just one block with 1024 threads. What I’m using the above code to do is step through the entire matrix in column order. So, I use placeholder to see how many elements in that particular iteration of the while loop match “num”. I then know that after that iteration of the while loop, I can move “placeholder” elements forward and do the same thing. Since it’s one block, I don’t have to worry about out of order block execution or communication. Surprisingly, this is significantly faster than my old method. The old method took about 250 ms for a 6000x6000 matrix, the method shown above takes about 33ms. The biggest bottle neck was the CPU. As for the sparsity - the matrices that I’m generating are generally pretty sparse (about 1% filled) - so I think the matrix I just talked about has about 174,000 elements.

With regards to your suggestions - I’m not using atmoicAdd to sum the elements. In fact, in the algorithm shown in my previous thread, there isn’t a summation. I understand the code is probably a bit hard to follow, and that is my fault.

In CPU code, each element of the matrix is touched just once. The problem with the CPU code is that once N gets large, the CPU is unable to do these calculations fast enough for the simulator. The code I’ve posted is actually just the first part of the algorithm. The second part is actually the trickier part, as it does include a summation (which I’ve modified the reduction kernel from the SDK for) and it too has the same restrictions as the first half, but going in row order starting at N-1.

I was able to get the code above to work, below is what it looks like:

``````__global__ void L(cuDoubleComplex *a, cuDoubleComplex* b, cuDoubleComplex *c, int N, int ne){

// num refers to the particular column of the matrix

// N = vector size

int num = 0;			// number is the size of the num*num matrix

__shared__ int placeholder;	// variable to keep track of how many elements match the current col, num

while(num < N && tid < ne){

tid = placeholder + threadIdx.x;		// increment tid by placeholder

if(a[tid].px == num && a[tid].py==num){

if(num==0){					// hardcode special case where num==0

c[num] = a[num] * b[num];		// number on the diagonal for num=0

}else{

c[num] = a[tid] * b[num];		// number on the diagonal

}

atomicAdd(&placeholder, 1);			// increment placeholder for diagonal

}

if(a[tid].px == num && a[tid].py > num){

b[a[tid].py] = b[a[tid].py] + a[tid] * b[num];

atomicAdd(&placeholder, 1);			// increment placeholder for all elements below the diagonal

}

if(a[tid].px == num && a[tid].py < num)

atomicAdd(&placeholder, 1);			// increment the placeholder for any elements above the diagonal

num++;

}

}
``````

1. use an array to record the number of non-zero elements in each column when a is produced, so that you can do away with the atomic operation.

2. Put the entirety of b in shared memory. This should speed things up a little as the L1 cache can’t possibly contain the entire of b.

3. Move the ifs out of your loop. Set the special case on a different code path (A different warp altogether). This could be done like this:

``````int offset = 0;

if(tid<32) //on the diagonal. set an entire warp to a different path to avoid branching

{

for(int i =0; i<N; i++)

{

c[i] = a[offset] * b[i]; //even though there are 32 operations in this warp, it's effectively

// a single one cos it's the same for all 32 threads

offset += counter[i];

}

}

else

{

tid -= 31; //make up for the gap introduced above.

// Anyway you'll have enough threads to handle the work given such high degree of sparsity

for(int i =0; i<N; i++)

{

if(tid<counter[i]) //below the diagonal, i'm assuming only elements on&below the diagonal go into a

{

int y = a[offset + tid].py;

b[y] += a[offset + tid] * b[i];

}

int offset += counter[i];

__syncthreads(); // a guy did this test before, and result was positive

// by the same number of times, execution resumes.

}

}
``````

May have some errors in the code I gave… hope you get the gist

you’re saying the second half is the half above the diagonal? Isn’t that more or less the same as L? (you can check my last post in your previous thread. You’ll realize when you expand the b[i], you get your current algorithm(column-dependent). When you don’t expand it, you get what I had written there(row-dependent), which I though may not be optimal, depending on the filling pattern of the matrix). If it is indeed as I have imagined, then you can do calculation for both sections together, which should improve efficiency a bit further.

There’s another optimization that could be done:

use cuComplex, not cuDoubleComplex. since if you use the counter, the x coordinate would be unnecessary. Then you can just store the y coordinates in a separate array. This also saves some bandwidth. Of course, to see whether this would be successful or not you’ll have to see whether the compiler uses ld64 instead of ld. If it’s using ld, then I guess you’ll have to stop using cuComplex altogether and use one array for real and another array for imaginary part.

and you can try this one, though I’m not sure if it will help or not cos it introduces a bit of overhead in the calculation of offset, which could only be effectively reduced when you have a lesser degree of sparsity: use padding to ensure that the second element in any column is aligned to 128-byte. This would be easier for coalesced access.

174,000 / 10,000,000 = 0.00174; 0.00174 * 10,000 = 17.4;

Also, i think maybe launching 256 threads should be more than enough, if the non-zero values are uniformly distributed.

Anyway, given such low effective occupancy your code is having, i’m very surprised that your GPU code still runs faster than your CPU code !!! (only 1 effective warp for most of the time on the entire device… memory access not coalesced… inefficient use of atomicAdd… OMG I don’t believe it) how much is your current speedup? I’m really tempted to think that your CPU code has some really big problems.

Thanks for the code - I will certainly try to implement this when I get home tonight. It does seem much more elegant to avoid branching by using an entire warp. I bet your code will squeeze even more performance out.

The second half of the algorithm is similar to L, except that instead of going column wise, we are going row wise. It is also reversed, in that we start at row N-1 and work our way to zero. We’re also staying to the right of the diagonal since it is row wise.

I will plan to implement your suggestion of implementing an array of how many non-zero elements are produced per column/row. That does seem to make more sense than using atomicAdd.

Thanks again!

you intentionally wrote the cpu code in an extremely inefficient manner to show your boss how great GPGPU can be, didn’t you?

lol - no, I didn’t. But it should be said that this work is for a paper I’m writing, not for an actual implementation. I have been working with a company to develop a copy of their CPU algorithm, and then obviously parallelize it with CUDA.

That said - I implemented some of the suggestions you made as well as a few other things that I found, and things are looking pretty good. I’m actually achieving about an 17x speed up over the CPU. For an algorithm that is barely parallelizable due to the data dependencies and what not, I am pretty proud of that.

Using cudaprof, it looks like I’m not as inefficient as one might think. I’m getting about a 67% occupancy, which according to the manual isn’t all that bad (they state anything above 50 is pretty good).

There is still some more work to be done, but I will likely have to hold off - I only have a month or so to write the paper.

Again, thanks for your help - I very much so appreciate it!

Well, as long as things work out for you, that’s good!