How to optimize this code (whirlpool hash)? Independant parallel hashing with whirlpool is way too s

I don’t have time too look closely right now, but I did some cryptography on CUDA too. (DES, RC5 and XXTEA crackers)
I figured that due to the pseudo-random access patterns lookup tables are best placed in shared memory if they fit. They will never be coalesced if they are in gmem, and cmem is inefficient if threads read different addresses.
On a related note: With my microbenchmarks I was able to saturate both MMU pipeline and gmem bandwidth with only ONE warp on a GeForce 9650M GT. So optimizing memory access patterns is all that matters.
You might also take a look at bitslice implementation, which will eliminate lookups completely. I did it with DES and (comparing mobile GeForce 9650M GT with Core2 Duo) it’s 3 times faster than bitslice on CPU (but 17 times faster than regular implementation on CPU). Word of warning: it’s a LOT of (boring) work.

I might help you more later, when I have time to play with your code.

W.

Ok, I had some time to look at your code. I can second pretty much everything sergeyn said (especially coalescing and using smem for lookups) plus:

  1. In whirlpool_compress you’re loading data[i] twice - that’s just waste of gmem bandwidth.
  2. In whirpool_trafo I’d try reordering assignments: L[0]=… should be close to L[1]=… It’ll be easier for compiler to reuse registers there.
    The only difference between your host and device functions is using textures. You don’t really use CUDA architecture.
    IMO enabling coalescing and using smem for lookups would do the most good here.

Ok, I had some time to look at your code. I can second pretty much everything sergeyn said (especially coalescing and using smem for lookups) plus:

  1. In whirlpool_compress you’re loading data[i] twice - that’s just waste of gmem bandwidth.
  2. In whirpool_trafo I’d try reordering assignments: L[0]=… should be close to L[1]=… It’ll be easier for compiler to reuse registers there.
    The only difference between your host and device functions is using textures. You don’t really use CUDA architecture.
    IMO enabling coalescing and using smem for lookups would do the most good here.

Sorry, i do not understand your suggestion.

Using tex2dfetch would allow me to do tex2Dfetch(texC, K[0], 0, 4)

instead of tex1Dfetch(texC0, K[0]) and tex1Dfetch(texC4, K[0])

but i do not think, that’s an improvement.

Do you mean to pack texC0 and TexC4 into an 2d-Array and do a 1Dfetch on that?

Sorry, i do not understand your suggestion.

Using tex2dfetch would allow me to do tex2Dfetch(texC, K[0], 0, 4)

instead of tex1Dfetch(texC0, K[0]) and tex1Dfetch(texC4, K[0])

but i do not think, that’s an improvement.

Do you mean to pack texC0 and TexC4 into an 2d-Array and do a 1Dfetch on that?

Hi wwa,

thanks for your help!

Yes, fixed (for ctx->state, not for data), but is not a measurable improvement.

I fixed that (actually the code should use constant array indices), allowed the compiler to use more registers (by leaving out the maxregcount argument), used only one load per dword in whirlpool_trafo, and did some manual and automatic unrolling. The new code (see attachment) even checks the results for correctness.

I use the architecture only for a very course grained parallelism. I do not have small kernels called a lot.

Do you think it would be a win, if i only did the iteration on the host and only call whirlpool_trafo using cuda?

smem lookups are done, coalescing is not. See the attached code.

Result: I get factor 6.1 on GTS250 and 15.9 on tesla.

So the changes did not help much on GTS250 but speeded up tesla by a factor 3.

This could point to a coalescing problem, doesn’t it?

Hi wwa,

thanks for your help!

Yes, fixed (for ctx->state, not for data), but is not a measurable improvement.

I fixed that (actually the code should use constant array indices), allowed the compiler to use more registers (by leaving out the maxregcount argument), used only one load per dword in whirlpool_trafo, and did some manual and automatic unrolling. The new code (see attachment) even checks the results for correctness.

I use the architecture only for a very course grained parallelism. I do not have small kernels called a lot.

Do you think it would be a win, if i only did the iteration on the host and only call whirlpool_trafo using cuda?

smem lookups are done, coalescing is not. See the attached code.

Result: I get factor 6.1 on GTS250 and 15.9 on tesla.

So the changes did not help much on GTS250 but speeded up tesla by a factor 3.

This could point to a coalescing problem, doesn’t it?

Yes, that will also work (but with 2d texture you should save a bit on address calculation). And as I mentioned before - remove local mem accesses and put your table into smem to all your data on the chip, that’s what should give you max performance.

Yes, that will also work (but with 2d texture you should save a bit on address calculation). And as I mentioned before - remove local mem accesses and put your table into smem to all your data on the chip, that’s what should give you max performance.