Problems with a control variable

Hi Guys,

I have a problem and i need your help.

Is posible create a variable that is used as a flag or control variable of the execution of the all threads?

My problems is that implement a brute force attack of RC4 Algorithm, and divide the key space in same portions, and i need implement one variable to control the execution of all threads. The objective is that one thread find the key that decipher the file, the others threads stop and finish the global (Kernel) function.

Please help me!

Thanks!

P.S: sorry, my english is not good.

Yes, you can have a variable in global memory that is tested by each thread or block in order to return early once the key is found.

Another option is using the [font=“Courier New”]trap[/font] PTX instruction.

Thanks for the info tera, but how declare the variable (device , share , etc.) and how evaluate the variable? where declare the variable (in the kernel, off the kernel, etc.)? could initialize the variable?

I really stuck on this, and i need evaluate the variable correctly to finish all other threads when one thread found the key.

P.S: the PTX instruction is like this asm(“trap;”), i wrote my source code in C.

If your device supports concurrent copy and kernel execution (so that you can set the flag while the kernel is running), put the variable in device memory and declare it as [font=“Courier New”]volatile[/font]. When setting the flag from the host using cudaMemcpy(), you need to make sure to use a different stream so that the copy is not delayed until after the kernel finishes. On compute capability 1.x devices (particularly 1.0 and 1.1 with their strict coalescing rules), it may be worth copying the flag from device to shared memory before testing it in each thread to reduce global memory bandwidth and contention.
If your device does not support concurrent copy and execute, or if it is an integrated GPU without its own memory, put the flag into mapped host memory (and declare it as [font=“Courier New”]volatile[/font]).

tera, i have a Geforce GTX460 with a 2.1 capabilities.

The secret is declare the variable as volatile and declarate in the body of kernel function?

Declare it outside of any function, so you can easily manipulate it from the host with cudaMemcpyToSymbolAsync().

I try to explain that understand

  1. Declare the variable as volatile and initialize

  2. Copy the content of the variable from host to device with cudaMemcpyToSymbolAsync()

But, how use the variable to stop all threads?

This is my Kernel code:

__global__ void brute_force(unsigned char *texto_dev, int lt, unsigned char *clave_dev, unsigned char *descifrado_dev, unsigned char *tid_dev) {  

	

	unsigned char alfabeto[56] = {'a','b','c','d','e','f','g','h','i','j','k','l','m','n','o','p','q','r','s','t','u','v','w','x','y','z',

'A','B','C','D','E','F','G','H','I','J','K','L','M','N','O','P','Q','R','S','T','U','V','W','X','Y','Z',' ',',','.','

global void brute_force(unsigned char *texto_dev, int lt, unsigned char *clave_dev, unsigned char *descifrado_dev, unsigned char *tid_dev) {

unsigned char alfabeto[56] = {'a','b','c','d','e','f','g','h','i','j','k','l','m','n','o','p','q','r','s','t','u','v','w','x','y','z',

‘A’,‘B’,‘C’,‘D’,‘E’,‘F’,‘G’,‘H’,‘I’,‘J’,‘K’,‘L’,‘M’,‘N’,‘O’,‘P’,‘Q’,‘R’,‘S’,‘T’,‘U’,‘V’,‘W’,‘X’,‘Y’,‘Z’,’ ‘,’,‘,’.‘,’\0’};

    int y, c1, c2, lc = 5, ii, gen_key = 0, valor;

int id = threadIdx.x;

int N_combinaciones = 10000;

int key_inicial = id * N_combinaciones;

int key_final = key_inicial + N_combinaciones;

/* KEY >> */

  for (ii = 0; ii < 5; ii++) {

       clave_dev[257 * id + ii] = 'X';

}

clave_dev[257 * id + 5] = ‘\0’;

/* KEY << */

for (gen_key = key_inicial; gen_key < key_final && !encontrado; gen_key++) { 

	

	valor = gen_key;  

	for (ii = 0; ii < 5; ii++) {  

		clave_dev[257 * id + 4 - ii] = '0' + (valor % 10);

		valor = valor / 10;

	}

		clave_dev[257 * id + lc] = '\0'; 

	  

	rc4_init(id, &clave_dev[257 * id], lc);  	

  		  	      

	for (y = 0; y < lt; y++) {  

	     descifrado_dev[1025 * id + y] = texto_dev[y] ^ rc4_output(id); 

	}



	for (c1 = 0; c1 < lt; c1++) {  

		for (c2 = 0; c2 < 56; c2++) {

			if (descifrado_dev[1025 * id + c1] == alfabeto[c2]) {

				break;

			}

		}

		if (c2 == 56) {  

			break;

		}

	}



	if (c1 == (lt-1)) { 

	    tid_dev[0] = (unsigned char) ('0' + id);

	    encontrado = 1;

	    break;

	    //gen_key = 100000;

	}

}

// tid_dev[0] = (unsigned char) (‘0’ + id);

}

'}; 

	

        int y, c1, c2, lc = 5, ii, gen_key = 0, valor;

	int id = threadIdx.x;

	int N_combinaciones = 10000;

	int key_inicial = id * N_combinaciones;

	int key_final = key_inicial + N_combinaciones;

/* KEY >> */	    

      for (ii = 0; ii < 5; ii++) {

           clave_dev[257 * id + ii] = 'X';

	}

clave_dev[257 * id + 5] = '

global void brute_force(unsigned char *texto_dev, int lt, unsigned char *clave_dev, unsigned char *descifrado_dev, unsigned char *tid_dev) {

unsigned char alfabeto[56] = {'a','b','c','d','e','f','g','h','i','j','k','l','m','n','o','p','q','r','s','t','u','v','w','x','y','z',

‘A’,‘B’,‘C’,‘D’,‘E’,‘F’,‘G’,‘H’,‘I’,‘J’,‘K’,‘L’,‘M’,‘N’,‘O’,‘P’,‘Q’,‘R’,‘S’,‘T’,‘U’,‘V’,‘W’,‘X’,‘Y’,‘Z’,’ ‘,’,‘,’.‘,’\0’};

    int y, c1, c2, lc = 5, ii, gen_key = 0, valor;

int id = threadIdx.x;

int N_combinaciones = 10000;

int key_inicial = id * N_combinaciones;

int key_final = key_inicial + N_combinaciones;

/* KEY >> */

  for (ii = 0; ii < 5; ii++) {

       clave_dev[257 * id + ii] = 'X';

}

clave_dev[257 * id + 5] = ‘\0’;

/* KEY << */

for (gen_key = key_inicial; gen_key < key_final && !encontrado; gen_key++) { 

	

	valor = gen_key;  

	for (ii = 0; ii < 5; ii++) {  

		clave_dev[257 * id + 4 - ii] = '0' + (valor % 10);

		valor = valor / 10;

	}

		clave_dev[257 * id + lc] = '\0'; 

	  

	rc4_init(id, &clave_dev[257 * id], lc);  	

  		  	      

	for (y = 0; y < lt; y++) {  

	     descifrado_dev[1025 * id + y] = texto_dev[y] ^ rc4_output(id); 

	}



	for (c1 = 0; c1 < lt; c1++) {  

		for (c2 = 0; c2 < 56; c2++) {

			if (descifrado_dev[1025 * id + c1] == alfabeto[c2]) {

				break;

			}

		}

		if (c2 == 56) {  

			break;

		}

	}



	if (c1 == (lt-1)) { 

	    tid_dev[0] = (unsigned char) ('0' + id);

	    encontrado = 1;

	    break;

	    //gen_key = 100000;

	}

}

// tid_dev[0] = (unsigned char) (‘0’ + id);

}

';

/* KEY << */

	for (gen_key = key_inicial; gen_key < key_final && !encontrado; gen_key++) { 

  		

  		valor = gen_key;  

		for (ii = 0; ii < 5; ii++) {  

			clave_dev[257 * id + 4 - ii] = '0' + (valor % 10);

			valor = valor / 10;

		}

    		clave_dev[257 * id + lc] = '

global void brute_force(unsigned char *texto_dev, int lt, unsigned char *clave_dev, unsigned char *descifrado_dev, unsigned char *tid_dev) {

unsigned char alfabeto[56] = {'a','b','c','d','e','f','g','h','i','j','k','l','m','n','o','p','q','r','s','t','u','v','w','x','y','z',

‘A’,‘B’,‘C’,‘D’,‘E’,‘F’,‘G’,‘H’,‘I’,‘J’,‘K’,‘L’,‘M’,‘N’,‘O’,‘P’,‘Q’,‘R’,‘S’,‘T’,‘U’,‘V’,‘W’,‘X’,‘Y’,‘Z’,’ ‘,’,‘,’.‘,’\0’};

    int y, c1, c2, lc = 5, ii, gen_key = 0, valor;

int id = threadIdx.x;

int N_combinaciones = 10000;

int key_inicial = id * N_combinaciones;

int key_final = key_inicial + N_combinaciones;

/* KEY >> */

  for (ii = 0; ii < 5; ii++) {

       clave_dev[257 * id + ii] = 'X';

}

clave_dev[257 * id + 5] = ‘\0’;

/* KEY << */

for (gen_key = key_inicial; gen_key < key_final && !encontrado; gen_key++) { 

	

	valor = gen_key;  

	for (ii = 0; ii < 5; ii++) {  

		clave_dev[257 * id + 4 - ii] = '0' + (valor % 10);

		valor = valor / 10;

	}

		clave_dev[257 * id + lc] = '\0'; 

	  

	rc4_init(id, &clave_dev[257 * id], lc);  	

  		  	      

	for (y = 0; y < lt; y++) {  

	     descifrado_dev[1025 * id + y] = texto_dev[y] ^ rc4_output(id); 

	}



	for (c1 = 0; c1 < lt; c1++) {  

		for (c2 = 0; c2 < 56; c2++) {

			if (descifrado_dev[1025 * id + c1] == alfabeto[c2]) {

				break;

			}

		}

		if (c2 == 56) {  

			break;

		}

	}



	if (c1 == (lt-1)) { 

	    tid_dev[0] = (unsigned char) ('0' + id);

	    encontrado = 1;

	    break;

	    //gen_key = 100000;

	}

}

// tid_dev[0] = (unsigned char) (‘0’ + id);

}

'; 

		  

  		rc4_init(id, &clave_dev[257 * id], lc);  	

  	  		  	      

  		for (y = 0; y < lt; y++) {  

		     descifrado_dev[1025 * id + y] = texto_dev[y] ^ rc4_output(id); 

		}

	

		for (c1 = 0; c1 < lt; c1++) {  

			for (c2 = 0; c2 < 56; c2++) {

				if (descifrado_dev[1025 * id + c1] == alfabeto[c2]) {

					break;

				}

			}

			if (c2 == 56) {  

				break;

			}

		}

	

		if (c1 == (lt-1)) { 

		    tid_dev[0] = (unsigned char) ('0' + id);

		    encontrado = 1;

		    break;

		    //gen_key = 100000;

		}

	}

// tid_dev[0] = (unsigned char) ('0' + id);

}

My variable encontrado (means “found” in english) is that determine if one thread found the key or not, and is my flag to stop the all others threads.

tera,

I find the solution with a good friend thats help me.

My solution is the classic:

  • Create an array with large 1 (with malloc and cudaMalloc)
  • Initializate the array with 0
  • Copy the array from host to device.
  • When the condition if (c1 == (lt-1)) is true, the value of array is 1.

Others changes:

  • The variable encontrado is changed for found[0] to use like a flag and the break; sentence, inside of the condition if (c1 == (lt-1)) is eliminated because is not necessary.
  • The condition to stop the for loop: gen_key < key_final && !encontrado is changed for gen_key < key_final && found[0] == 0.

And that’s all. All the threads stop in the same position, but in similar values. For example, if the key to decipher is 07500, the thread 0 found the key and others threads stop in 17500 (thread 1), 27500 (thread 2), etc. (my specific code, only probe keys compose of 5 numeric digits). True paralellism and syncronized.

My little question is: still consume seconds of time to found the key, but i think that’s too much to complete the task. tera, if the key is very small (only 5 numeric digits), is normally that take seconds to find the key?

Exist documentation to validate my results?

Exist documentation that expose the average time to process the similar task?

Thanks for all the help! And please, if you know any documentation to validate my results, shared with me.

Thanks tera.

I’m glad you got the solution working.

I have no experience with brute-force breaking of RC4. However I notice that nowhere in your kernel you use the block index, thus either you have only one block leaving most of the compute power of the GPU unused, or all your blocks are doing the same work, which is just as wasteful.

I understand, i explore my code to improve the performance with your tips.

Your help is very useful!

Thanks!!!