Bottlenecks in SpBlockMatrix * DeVector Ideas on now to remove bottlenecks.

Sorry, had to remove it due to some corporate policies.

I was gong to say something about the float4’s (all computations are scalar on G80), but realized they do do you good because they align your memory accesses.
However, there is little need to use it if you implement memcpy’s from global into shared correctly. Once data is in shared memory, vect4 possibly hurts performance because unless the compiler/hardware does something smart, it’s an automatic bank conflict. Also, vect4 means an automatic (er, manual?) unrolling of loops. Using scalars might save registers

float4 m = matBlock[ts]; // This could be removed if using registers instead of shared mem for m

it seems m IS using registers

  1. Can one check if reads are being coallesced? No, not directly. But you can change your code to break the rules and see the performance drop.

  2. If it’s a small performance boost, then fight the urge to optimize. Remember, the real performance gains probably lie elsewhere. However, please explain why using the register becomes worse for large N’s.

  3. Why don’t you load the vector into shared mem so the raondom reads hurt much less? Textures or constants with their caches should also help. (But don’t use textures for matData or anything else that’s perfectly coallesced and not reused). To be honest, I don’t understand why iVec reads should be random.

  4. I think yes. In general, it seems each thread is doing too little work. Think about how you can fatten up each thread and take advantage of data reuse (e.g., if you do one or two blocks per thread, you’ll read iVec fewer times). Also, you can spend less time with the reduction, which looks slow (possibly the bulk of your present kernel). Do more ops between syncing threads (use bigger radix) and don’t operate on float4’s because you’re wasting time on .w’s. (ps, isn’t your reduction incorrect when (3*BLOCK_SIZE) is not a Po2 and the test condition should be “divider>=1”)?

  5. You’ll have to ask God that one (or the cubin). The compiler often does werid, dumb things but your branches, although frequent, are very simple.

oh, one more thing about the reduction:
I think if you change
if (ts < divider)
dotBlock[ts] += dotBlock[ts + divider];
if (ts >= divider)
dotBlock[ts] += dotBlock[ts + divider];
it’ll be better.

Thank you alex, your hints are most welcome!

First off, I will try to clarify some parts that were a bit vague described.

My point was that it is possible to remove the variable “matBlock” and use four registers as a float4 per thread instead. I introduced the variable “m” to make the code easier to read, sorry for making it more confusing instead =). I have added a comment to the code on how this could be done without “m”.

Currently, each thread uses ~5 ints (tx,ty,…) and, in case of using regs instead of shared, 2*float4 (m,v) = 8 floats. Giving a total of ~ 13 regs / thread. One block consists of 128x3 threads in this case, which gives a total of 128x3x13 ~ 5000 threads / MP. If I then increase the block dimension to its maximum(512 threads) => 170x3x13 I will get closer to maximum available threads / MP. However, this is not problem for my example but might be an issue later. I am considering sorting the matrix so that adjacent rows will have a high count of intersecting vector indices, making it possible to throw fewer reads to global mem.

Simply because it will not fit into shared mem. The shared mem is 16k, which allows a maximum of N=1024. I want to be able to use this algorithm for N>3000, however that N is big will not affect the number of elements per row in the matrix (it will still be [Nx128], my test data is generated only to give the matrix a similar structure).

BLOCK_WIDTH will always set to a Po2, if not I will not be able to perform the sum-reduction in this manner.

ptx can’t use shared memory as a parameter to an opcode, so it all gets converted to registers anyway.

No, you can’t do that. First of all, you’ve just forgot all the temporary variables for address calculations and all the extra registers ptxas is going to use to optimize. Plus, registers are allocated only when they are needed and dropped immediately afterward. To find out register usage, you have to get a cubin (nvcc -keep) and look at what it says inside.

Ok, you’ll have to use textures then. However, you still didn’t explain why the vector is being randomly indexed. Matrix multiplication is usually completely regular.