use an array to record the number of non-zero elements in each column when a is produced, so that you can do away with the atomic operation.
Put the entirety of b in shared memory. This should speed things up a little as the L1 cache can’t possibly contain the entire of b.
Move the ifs out of your loop. Set the special case on a different code path (A different warp altogether). This could be done like this:
int offset = 0;
if(tid<32) //on the diagonal. set an entire warp to a different path to avoid branching
for(int i =0; i<N; i++)
c[i] = a[offset] * b[i]; //even though there are 32 operations in this warp, it's effectively
// a single one cos it's the same for all 32 threads
offset += counter[i];
__syncthreads(); //here's the trick
tid -= 31; //make up for the gap introduced above.
// Anyway you'll have enough threads to handle the work given such high degree of sparsity
for(int i =0; i<N; i++)
if(tid<counter[i]) //below the diagonal, i'm assuming only elements on&below the diagonal go into a
int y = a[offset + tid].py;
b[y] += a[offset + tid] * b[i];
int offset += counter[i];
__syncthreads(); // a guy did this test before, and result was positive
//so as long as all threads in the block reach __syncthreads
// by the same number of times, execution resumes.
May have some errors in the code I gave… hope you get the gist
you’re saying the second half is the half above the diagonal? Isn’t that more or less the same as L? (you can check my last post in your previous thread. You’ll realize when you expand the b[i], you get your current algorithm(column-dependent). When you don’t expand it, you get what I had written there(row-dependent), which I though may not be optimal, depending on the filling pattern of the matrix). If it is indeed as I have imagined, then you can do calculation for both sections together, which should improve efficiency a bit further.
There’s another optimization that could be done:
use cuComplex, not cuDoubleComplex. since if you use the counter, the x coordinate would be unnecessary. Then you can just store the y coordinates in a separate array. This also saves some bandwidth. Of course, to see whether this would be successful or not you’ll have to see whether the compiler uses ld64 instead of ld. If it’s using ld, then I guess you’ll have to stop using cuComplex altogether and use one array for real and another array for imaginary part.
and you can try this one, though I’m not sure if it will help or not cos it introduces a bit of overhead in the calculation of offset, which could only be effectively reduced when you have a lesser degree of sparsity: use padding to ensure that the second element in any column is aligned to 128-byte. This would be easier for coalesced access.
174,000 / 10,000,000 = 0.00174; 0.00174 * 10,000 = 17.4;
Also, i think maybe launching 256 threads should be more than enough, if the non-zero values are uniformly distributed.
Anyway, given such low effective occupancy your code is having, i’m very surprised that your GPU code still runs faster than your CPU code !!! (only 1 effective warp for most of the time on the entire device… memory access not coalesced… inefficient use of atomicAdd… OMG I don’t believe it) how much is your current speedup? I’m really tempted to think that your CPU code has some really big problems.