How to optimize this code (whirlpool hash)? Independant parallel hashing with whirlpool is way too s

Hi

the attached code is a heavily stripped down version of an application which needs an iterated whirlpool hash for a lot of inputs.

Inside the code each thread basically hashes (actually it does only hash compression without finalization) its own private input data.
There is no dependency on results of other threads.

The hash code itself has been ported from a 32 bit version of this hash algorithm.

Each thread hashes input data by calling the hash compression function a lot of times (typically some 1000s).

The code measures the real time used for processing the task on GPU and on CPU.

Typical input data i have translates to the command line
./whirlpool 3000 48 4000

On my GTS250 i get about 4.8 seconds on GPU and 23.2 second on CPU yielding a speedup less than 5.
On my Tesla C1060 i get a speedup of about 6.

Similar code for other hash algorithms (SHA*-family) give speedups of at least 35 for GTS250 and 50 on tesla.

Especially important to me seems speeding up the function whirlpool_trafo.

I have tried everything i could think of but nothing pushed the efficiency of this code.
I tried to preload data to registers, ordering the byte accesses in different schemes, using shared memory for L and accumulate results instead of one big instruction, using the input data K word wise and other things.

Since i am just at the beginning of my cuda career i want to learn from that.
Thus please tell me, how you try improving the code, which tools to use and why you change which code!

Other parts of my application use shared memory for caching. This is already reserved in the kernel launch.

If this matters anyhow: I work on linux.

Thanks for any help on this in advance!

First of all run the profiler on your thing to see exactly what is slowing it down.

My random guesses:

  1. You are using 8 tables of 1kb each, try using constant memory instead of textures. You could also try using shared memory for lookups.

  2. Check your local memory usage, your code seem to be heavy on registers for all the temps you are using, so you might have a lot of temps spilled into local memory. You could use shared mem for some of your local vars to reduce the pressure on local memory

  3. Try not doing byte loads (in function whirlpool_trafo you are byte-addressing array K), load 4bytes as one unsinged int and do manual shifts

  4. you are using 3 64byte tables of per-thread data, but at any point in time you seem to work with only 2 of them, so maybe there is a way to fuse all those whirlpool_trafo calls and cut 1/3 of your per-thread temps.

  5. Not significant issue in your case, but memory accesses through ctx variable are not coalesced.

Sergey.

First of all run the profiler on your thing to see exactly what is slowing it down.

My random guesses:

  1. You are using 8 tables of 1kb each, try using constant memory instead of textures. You could also try using shared memory for lookups.

  2. Check your local memory usage, your code seem to be heavy on registers for all the temps you are using, so you might have a lot of temps spilled into local memory. You could use shared mem for some of your local vars to reduce the pressure on local memory

  3. Try not doing byte loads (in function whirlpool_trafo you are byte-addressing array K), load 4bytes as one unsinged int and do manual shifts

  4. you are using 3 64byte tables of per-thread data, but at any point in time you seem to work with only 2 of them, so maybe there is a way to fuse all those whirlpool_trafo calls and cut 1/3 of your per-thread temps.

  5. Not significant issue in your case, but memory accesses through ctx variable are not coalesced.

Sergey.

Hello Sergey,

thanks for your reply,

I tried that → it got much slower (about a factor 10). As far as i understand it using constant memory is a win only, when all threads in a (half-?)warp read the same variable. This is definitely not the case.

Tried that, too. → Slows down the application about 6 times.

O.k. i will see what i can do. Does need some time, though.

As i mentioned in my first post, i tried that, too. If i remember correctly the result got about two times slower.

Yes, but it seems that all three are needed and cannot be reused.

Yes, i hope that’s really not significant here. Maybe it becomes important when the code got faster.

Thanks for you reply!

Hello Sergey,

thanks for your reply,

I tried that → it got much slower (about a factor 10). As far as i understand it using constant memory is a win only, when all threads in a (half-?)warp read the same variable. This is definitely not the case.

Tried that, too. → Slows down the application about 6 times.

O.k. i will see what i can do. Does need some time, though.

As i mentioned in my first post, i tried that, too. If i remember correctly the result got about two times slower.

Yes, but it seems that all three are needed and cannot be reused.

Yes, i hope that’s really not significant here. Maybe it becomes important when the code got faster.

Thanks for you reply!

Regarding 4. : Yes, you technically need 3 of them, but as I can see you use 2 tables as input to write 1 table as output. I’d try write the output into the space occupied by the 2 input tables, since on the next step you are going to use 1 output table and 1 (out of 2) original input tables. That means you could go around 2 tables + small extra for shuffling the temps around.

Regarding 3. According to the assembly generated each byte load is an actual load from the local memory, not the register read. I would recommend comparing the assembly generated with code doing 1 4byte load.

At every step you are trying you need to understand where the real bottleneck is (and why the result became slower). And always profile each of your attempts and always, always look at the assembly generated because compiler is not that smart as you might think. For example I’m surprised that you got it slower by using shared mem for temps instead of local mem. If done right I can’t see a situation where it can become slower (since spilling regs to smem is always faster then spilling them to local mem), so check the assembly generated to make sure that the local memory usage as well as number of loads instructions has decreased.

And Good Luck!

Sergey.

Regarding 4. : Yes, you technically need 3 of them, but as I can see you use 2 tables as input to write 1 table as output. I’d try write the output into the space occupied by the 2 input tables, since on the next step you are going to use 1 output table and 1 (out of 2) original input tables. That means you could go around 2 tables + small extra for shuffling the temps around.

Regarding 3. According to the assembly generated each byte load is an actual load from the local memory, not the register read. I would recommend comparing the assembly generated with code doing 1 4byte load.

At every step you are trying you need to understand where the real bottleneck is (and why the result became slower). And always profile each of your attempts and always, always look at the assembly generated because compiler is not that smart as you might think. For example I’m surprised that you got it slower by using shared mem for temps instead of local mem. If done right I can’t see a situation where it can become slower (since spilling regs to smem is always faster then spilling them to local mem), so check the assembly generated to make sure that the local memory usage as well as number of loads instructions has decreased.

And Good Luck!

Sergey.

Also, if you find texloads are faster, you could half the number of texloads you do by using 2d texture and do 1 tex2dfetch instead of 2 tex1dfetch.

Sergey.

Also, if you find texloads are faster, you could half the number of texloads you do by using 2d texture and do 1 tex2dfetch instead of 2 tex1dfetch.

Sergey.

Thinking more about this problem I think you should be able to make all your calculations on the chip, i.e - move your tables to smem (even with bank conflicts it still should be faster) and completely eliminate local memory usage. In this case you do not need high occupancy, you should be able to fit 8 warps on to 1 MP and that would saturate the chip imho.

Sergey.

Thinking more about this problem I think you should be able to make all your calculations on the chip, i.e - move your tables to smem (even with bank conflicts it still should be faster) and completely eliminate local memory usage. In this case you do not need high occupancy, you should be able to fit 8 warps on to 1 MP and that would saturate the chip imho.

Sergey.

I attached my shared version. I hope i have not messed it up.

How do you inspect the generated assembly? I use the 3.1 framework.

Thanks!

I attached my shared version. I hope i have not messed it up.

How do you inspect the generated assembly? I use the 3.1 framework.

Thanks!

Few things I can point out about your shared mem implementation:

You run 63 thread blocks with 48 threads per block which is really a bad choice. First of all - you should always have number of threads per block a multiple of the warp size (which is 32). 2nd - since your data in shared mem will occupy almost all of it, you will not be able to run more than 1 block simultaneously on 1 mp, thus you want to maximize the number of threads inside the threadblock (which in your case is going to be limited by register usage). Grasping at your register usage you should be able to fit 256 sized thread block (= 8 warps) if I’m not mistaken. The next step would be to eliminate local memory usage, and byte loads (aliased with 4byte stores) I think is what is forcing the compiler to use local memory instead of registers.

Then the way you initialize the smem is not friendly - you are basically doing 2048 insanely slow serial memory loads from global memory with 1 warp only, which is comparable to the number of total instruction a kernel has to execute.

I use -ptx flag to get the ptx output, and also a there is a disassembler in the registered developer section, which shows you exactly what hardware is executing.

Sergey.

Few things I can point out about your shared mem implementation:

You run 63 thread blocks with 48 threads per block which is really a bad choice. First of all - you should always have number of threads per block a multiple of the warp size (which is 32). 2nd - since your data in shared mem will occupy almost all of it, you will not be able to run more than 1 block simultaneously on 1 mp, thus you want to maximize the number of threads inside the threadblock (which in your case is going to be limited by register usage). Grasping at your register usage you should be able to fit 256 sized thread block (= 8 warps) if I’m not mistaken. The next step would be to eliminate local memory usage, and byte loads (aliased with 4byte stores) I think is what is forcing the compiler to use local memory instead of registers.

Then the way you initialize the smem is not friendly - you are basically doing 2048 insanely slow serial memory loads from global memory with 1 warp only, which is comparable to the number of total instruction a kernel has to execute.

I use -ptx flag to get the ptx output, and also a there is a disassembler in the registered developer section, which shows you exactly what hardware is executing.

Sergey.

I initialize the smem (C0 to C7) from constant memory (C0_c to C7_c). At least i hope so…

I initialize the smem (C0 to C7) from constant memory (C0_c to C7_c). At least i hope so…

I mean you should utilize all threads to do initialization. When dealing with global memory you need to hide memaccess delays as much as you can.

I mean you should utilize all threads to do initialization. When dealing with global memory you need to hide memaccess delays as much as you can.

I don’t have time too look closely right now, but I did some cryptography on CUDA too. (DES, RC5 and XXTEA crackers)
I figured that due to the pseudo-random access patterns lookup tables are best placed in shared memory if they fit. They will never be coalesced if they are in gmem, and cmem is inefficient if threads read different addresses.
On a related note: With my microbenchmarks I was able to saturate both MMU pipeline and gmem bandwidth with only ONE warp on a GeForce 9650M GT. So optimizing memory access patterns is all that matters.
You might also take a look at bitslice implementation, which will eliminate lookups completely. I did it with DES and (comparing mobile GeForce 9650M GT with Core2 Duo) it’s 3 times faster than bitslice on CPU (but 17 times faster than regular implementation on CPU). Word of warning: it’s a LOT of (boring) work.

I might help you more later, when I have time to play with your code.

W.