copying device "intermediate" variables in SMEM

Hi all,

first of all, I apologise for any of my “english” mistakes.

I’m trying to better optimize a code which looks for the char string (i.e. “the password”) corresponding to a given SHA1 message digest by using a brute force search.

The code is organized as follows:

[codebox]int main(int argc, char* argv)

{

/* copy HostToDevice mem the “target” SHA1 msg digest and other parameters */

while(1)

{

    RunBruteForce_SHA1hash( grid, result);   

 }

/******* device code ***********/

extern “C” RunBruteForce_SHA1hash(grid , result)

{

SHA1_DoBrute<<< grid, 256 >>>( result );

}

global SHA1_DoBrute( result )

{

/* brute force loop on all possible character combinations*/

for(i=0; i< password_length; i++){

    .........

  Block[i] = char_to_process;

/sha is a data strucure holding all variables and arrays needed by SHA1/

SHA1Input(&sha, &Block[i], 1);

   }

 SHA1_Result(&sha);

/* compare SHA1 hash generated by the string Block[password_length] with given SHA1 hash */

 .......

}

device SHA1Input()

{

}

device otherSHA1functions()

{

}[/codebox]

Right now, my code has a low parallelism level, since each core calculates a different “pwd” to check and performs the whole (device ) Sha1 process to generate the corresponding hash. My question does not concern how to increase the parallelism level (which needs a parallel SHA1 algorithm), but “simply” how to make my code more performant (I already unrolled all the possible loops… ).

The variables needed by the SHA1 process flow are “intermediate”: they are created in the device memory and destroyed without ever being mapped by the host or copied to the host. I copied some “intermediate” variables to shared memory, since i understood is the good place to hold variables if you want a faster code.

For example, the array holding the message digest is an unsigned int msgDigest[5], in SMEM it must be an unsigned int msgDigest[blockDim][5]. This array is used in almost all the device SHA1 functions.

I define

[codebox]shared unsigned msgDigest[blockDim][5];

global SHA1_DoBrute( result ){

/msgDigest initialization per thread in the block/

 msgDigest[threadIdx.x][0]    = 0x67452301;

 msgDigest[threadIdx.x][1]    = 0xEFCDAB89;

 etc...

for(i=0; i< password_length; i++){

    .........

    SHA1Input(&sha, &Block[i], 1);

   }

}[/codebox]

I’m sorry if my question could sound silly, but do you think it makes sense? i run this code, and the result is NO IMPROVEMENT at all.

Do you have any suggestions?

Thank you all in advance

…no suggestions…?? sigh sigh

I haven’t carefully understood your code or what you are trying to do but I am guessing you mite be having shared memory bank conflicts hence you may see little or no performance improvement.Read the programming guide on how to best access your shared memory.

Also check how much local memory/registers are you using ? Local memory slows your code down by a lot so make sure you minimize that. Also if you are using too many registers it will lead you to less occupancy specially on an older compute capable device (<1.2) .

If you don’t understand what I wrote above you need to study the programming guide thoroughly.

Also what hardware are you using ?