Generating 62^8 WiFi password in 0.5s I need cloud for testing

tiputa · March 7, 2023, 4:14pm

I created a simulation program in VS2022 CUDA running solely on one __device__function which generates 200 billion of 8 character password (lower, upper case, digit) in about 500 ms (without printing last 3 passwords, row 73-79). I added to the loop PBKDF2 and HMAC-SHA1 function to search for a password by known PMKID.

For my NVIDIA GeForce GTX 960M I use:

# define blocks 4
# define threads 992
# define characters 8

When I run the program I get an error:
Kernel launch failed: too many resources requested for launch

Is there any chance to try to run it in a cloud on GPU with 8-12 GB for free?

Program for generating 200 billion of 8 character password (62^8):


#include "cuda_runtime.h"
#include "device_launch_parameters.h"

#include <stdio.h>
#include <iostream>
#include <chrono>

using namespace std;
using namespace std::chrono;

// NVIDIA GeForce GTX 960M (device 5.0) has 5 Multiprocessors * 32 threads * 128 CUDA Cores/MP = 20 480
//# define blocks 4
//# define threads 961
//# define characters 4
//# define blocks 4
//# define threads 961
//# define characters 5
//# define blocks 4
//# define threads 961
//# define characters 6
//# define blocks 4
//# define threads 992
//# define characters 7
# define blocks 4
# define threads 992
# define characters 8

cudaError_t cudaStatus;

// "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789"
__constant__ uint8_t charset[] = { 0x61, 0x62, 0x63, 0x64, 0x65, 0x66, 0x67, 0x68, 0x69, 0x6a, 0x6b, 0x6c, 0x6d, 0x6e, 0x6f, 0x70, 0x71, 0x72, 0x73, 0x74, 0x75, 0x76, 0x77, 0x78, 0x79, 0x7a, 0x41, 0x42, 0x43, 0x44, 0x45, 0x46, 0x47, 0x48, 0x49, 0x4a, 0x4b, 0x4c, 0x4d, 0x4e, 0x4f, 0x50, 0x51, 0x52, 0x53, 0x54, 0x55, 0x56, 0x57, 0x58, 0x59, 0x5a, 0x30, 0x31, 0x32, 0x33, 0x34, 0x35, 0x36, 0x37, 0x38, 0x39 };

__global__ void blackcat(void) {

	uint8_t password[characters];

	uint8_t counters[characters];
	uint64_t n = (pow(62, characters) / threads);	// Number of search cycles per thread

	for (int i = characters - 1; i >= 0; i--) {
		counters[i] = (n * threadIdx.x / (uint64_t)pow(62, characters - 1 - i) % 62);
	}

	while (n > 0) {

		bool flag = false;
		for (int i = characters - 1; i >= 0; i--) {
			password[i] = charset[counters[i]];
			if (i == characters - 1) {
				counters[i]++;
				if (counters[i] > 61) {
					counters[i] = (uint8_t)0;
					flag = true;
				}
			}
			else {
				if (flag) {
					counters[i]++;
					if (counters[i] > 61) {
						counters[i] = (uint8_t)0;
					}
					else {
						flag = false;
					}
				}
			}
		}

		// 960
		// Print last three generated passwords
		//if (threadIdx.x == threads - 1 && blockIdx.x == blocks - 1 && n < 4) {
		/*if (threadIdx.x == 0 && blockIdx.x == 0 && n > nn - 3 ) {
			printf("Thread[%d]",threadIdx.x);
			for (int i = 0; i < characters; i++) {
				printf(" %c", password[i]);
			}
			printf("\n");
		}*/

	/* Test whether we found the password,
	if yes, we print the password, terminate all threads and return prematurely from the function,
    it might be a good idea to print the run time occasionally so we know the program is still running */

		n--;
	}
}

int main() {

	auto start = high_resolution_clock::now();
	cudaSetDevice(0);
	cudaStatus = cudaGetLastError();
	    if (cudaStatus != cudaSuccess) {
        fprintf(stderr, "cudaSetDevice failed!  Do you have a CUDA-capable GPU installed?");
    }
	/*blackcat << <1, threads >> > ();*/
	blackcat << <blocks, threads >> > ();
	cudaStatus = cudaGetLastError();
	if (cudaStatus != cudaSuccess) {
		fprintf(stderr, "Kernel launch failed: %s\n", cudaGetErrorString(cudaStatus));
	}
	cudaDeviceSynchronize();
	cudaStatus = cudaGetLastError();
	    if (cudaStatus != cudaSuccess) {
        fprintf(stderr, "cudaDeviceSynchronize returned error code %d after launching addKernel!\n", cudaStatus);
    }

	auto stop = high_resolution_clock::now();
	auto duration = duration_cast<microseconds>(stop - start);
	printf("\nTime ms  = %llx (HEX)\n", duration.count());

	return 0;
}

Robert_Crovella · March 7, 2023, 4:25pm

That probably means (in this case) that you have too many registers per thread to launch a block with 992 threads. You can probably fix this by limiting the registers per thread at compilation, so as to be able to run it on your GPU. See note below.

I believe you should be able to do that with Google Colab. You can find many writeups on the web about how to get started using a GPU in colab.

Note: Your kernel, as posted with things commented out, doesn’t change any global or visible state on the GPU, and so if you are building a release project, the actual kernel code will be empty, due to compiler optimization. If you build a debug project this may not happen (but I wouldn’t recommend evaluating code in any fashion based on a debug build). If you uncomment some of your in-kernel printf statements, then your kernel will no longer be “empty”.

rs277 · March 7, 2023, 6:15pm

Adding the option --resource-usage to the nvcc command will allow you to see how much memory and how many registers are being used.

Edited: To remove incorrect statement made without thinking.

tiputa · March 7, 2023, 7:11pm

With CUDA C/C++, Command Line: --resource-usage the full version of program (with PBKDF2 & HMAC-SHA1 shows:

1>ptxas info : 31 bytes gmem, 80 bytes cmem[3]
1>ptxas info : Compiling entry function ‘_Z8blackcatv’ for ‘sm_50’
1>ptxas info : Function properties for _Z8blackcatv
1> 352 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
1>ptxas info : Used 86 registers, 320 bytes cmem[0], 208 bytes cmem[2]
1>ptxas info : Function properties for __internal_accurate_pow
1> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads[2]

rs277 · March 7, 2023, 7:19pm

Using the Occupancy Calculator, you will need to reduce the register usage down to 64, in order to have one block of 992 threads resident.

Robert_Crovella · March 7, 2023, 7:59pm

you could pass -maxrregcount=64 during compilation. I think there is also a VS project setting to do this, as well - change “Max Used Register” from 0 to 64.

tiputa · March 7, 2023, 8:04pm

The 64 registers are just the CUDA limit or my GTX 960M limit for 992 threads/block?

rs277 · March 7, 2023, 8:11pm

The Calculator allows you to change the Compute Capability across the different GPU generations. A check using Ampere 8.0 shows it still needs to be 64 registers.

njuffa · March 7, 2023, 8:22pm

pow() is an overloaded function and I have not checked how it is being translated here. But in general it is not a good idea to call a potentially complicated function like pow() when register pressure is high. Consider using a small table of integer powers of 62 instead.

Robert_Crovella · March 7, 2023, 8:31pm

Yep. Powers of 62 overflow a 32-bit integer after the 5th power, and overflow a 64-bit integer after the 10th power. If you divide by 1000, it overflows a 64-bit integer after the 12th power. So it’s a pretty small table. Since characters is 8, and a constant, you don’t need more than 8 entries, and the usage to set n is effectively a compile-time constant.

Robert_Crovella · March 7, 2023, 8:33pm

The CUDA limit on registers per thread is 255. If you want to set a lower limit (for occupancy reasons, as is pointed out in this thread) then you can set that lower limit for the compiler in a variety of ways. The objective of setting it here, is to allow the kernel to launch, without hitting the “too many resources” error you posted in the original posting.

tiputa · March 7, 2023, 8:52pm

Note that I need to spread the ranges of generated passwords evenly over all 992 threads:

uint8_t counters[characters];
	uint64_t n = (pow(62, characters) / threads);	// Number of search cycles per thread

	for (int i = characters - 1; i >= 0; i--) {
		counters[i] = (n * threadIdx.x / (uint64_t)pow(62, characters - 1 - i) % 62);
	}

I don’t think I can avoid to use Powers of 62 function.

log(62^8)/log(2) is 47.63 bits i.e. uint64_t is fine.

njuffa · March 7, 2023, 9:09pm

The proposal was to implement a powers-of-62 function via a lookup table:

const __constant__ uint64_t pow62_tab[9] = 
{
    1ull,
    62ull,
    3844ull,
    238328ull,
    14776336ull,
    916132832ull,
    56800235584ull,
    3521614606208ull,
    218340105584896ull
};

uint64_t n = pow62_tab [characters] / threads;

counters[i] = (n * threadIdx.x / pow62_tab[characters - 1 - i] % 62);

Robert_Crovella · March 7, 2023, 9:16pm

that can be made a compile-time constant

and your table doesn’t seem to handle the case of 0 index (returning 1 for the zero-th power).

njuffa · March 7, 2023, 9:47pm

Thanks for pointing that out. This is what happens when one writes code in the browser in between watching YouTube videos. I wonder whether the compiler can optimize pow62_tab [8] / 992 into a constant given that all data is available at compile time. I would hope it does, but have not tried it.

Robert_Crovella · March 7, 2023, 9:54pm

I’m doubtful it can do that when the “constant” is __constant__ memory. Just because you initialize __constant__ memory does not mean something else is not there when we get around to execution.

I’m fairly certain it can be done with ordinary C++ compile-time constants or constexpr.

njuffa · March 7, 2023, 10:07pm

Yeah, __constant__ is just a storage class specifier, so the table should be const __constant__ for the expression to get optimized into a constant:

#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>

#define characters (8)
#define threads    (992)

const __constant__ uint64_t pow62_tab[9] = 
{
    1ull,
    62ull,
    3844ull,
    238328ull,
    14776336ull,
    916132832ull,
    56800235584ull,
    3521614606208ull,
    218340105584896ull
};

__global__ void kernel (uint64_t *res)
{
    *res = pow62_tab [characters] / threads;
}

int main (void)
{
    uint64_t n, *n_d = 0;
    cudaMalloc ((void**)&n_d, sizeof (n_d[0]));
    kernel<<<1,1>>>(n_d);
    cudaMemcpy (&n, n_d, sizeof n, cudaMemcpyDeviceToHost);
    printf ("n=%llu\n", n);
    return EXIT_SUCCESS;
}

Using CUDA 11.8 and compiling for sm_75:

        code for sm_75
                Function : _Z6kernelPy
        .headerflags    @"EF_CUDA_SM75 EF_CUDA_PTX_SM(EF_CUDA_SM75)"
        /*0000*/                   MOV R1, c[0x0][0x28] ;        /* 0x00000a0000017a02 */
                                                                 /* 0x000fc40000000f00 */
        /*0010*/                   MOV R2, 0x3f0966f8 ;          /* 0x3f0966f800027802 */
                                                                 /* 0x000fe20000000f00 */
        /*0020*/                   ULDC.64 UR4, c[0x0][0x160] ;  /* 0x0000580000047ab9 */
                                                                 /* 0x000fe20000000a00 */
        /*0030*/                   MOV R3, 0x33 ;                /* 0x0000003300037802 */
                                                                 /* 0x000fd00000000f00 */
        /*0040*/                   STG.E.64.SYS [UR4], R2 ;      /* 0x00000002ff007986 */
                                                                 /* 0x000fe2000c10eb04 */
        /*0050*/                   EXIT ;                        /* 0x000000000000794d */
                                                                 /* 0x000fea0003800000 */
        /*0060*/                   BRA 0x60;                     /* 0xfffffff000007947 */
                                                                 /* 0x000fc0000383ffff */
        /*0070*/                   NOP;                          /* 0x0000000000007918 */
                                                                 /* 0x000fc00000000000 */

An attempt to modify a const __constant__ object with cudaMemcpyToSymbol() at runtime appears to be silently ignored.

Robert_Crovella · March 7, 2023, 10:51pm

There seems to be some nuance:

$ cat t2205.cu
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>

#define characters (8)
#define threads    (992)

const __constant__ uint64_t pow62_tab[9] =
{
    1ull,
    62ull,
    3844ull,
    238328ull,
    14776336ull,
    916132832ull,
    56800235584ull,
    3521614606208ull,
    218340105584896ull
};

__global__ void kernel (uint64_t *res, unsigned i)
{
    *res = pow62_tab [i] / threads;
}

int main (void)
{
    uint64_t a[9] = {0};
    uint64_t n, *n_d = 0;
    cudaMemcpyToSymbol(pow62_tab, a, 9*sizeof(uint64_t));
    cudaMalloc ((void**)&n_d, sizeof (n_d[0]));
    kernel<<<1,1>>>(n_d, characters);
    cudaMemcpy (&n, n_d, sizeof n, cudaMemcpyDeviceToHost);
    printf ("n=%llu\n", n);
    return EXIT_SUCCESS;
}
$ nvcc -o t2205 t2205.cu
$ compute-sanitizer ./t2205
========= COMPUTE-SANITIZER
n=0
========= ERROR SUMMARY: 0 errors
$

In the case where the compile time constant is discovered by the compiler (your posted case), it seems evident that a cudaMemcpyToSymbol could not affect compiled code. But in the case where the compiler doesn’t discover a compile-time constant, it seems that the contents of const __constant__ do get modified.

njuffa · March 7, 2023, 11:10pm

Excellent point. This seems to be a somewhat under-specified part of the CUDA language, and it seems to me the current behavior is somewhat non-intuitive, with inconsistent behavior observed depending on whether a compiler optimization was applied or not. My test program below behaves differently depending on whether it is compiled with -G or compiler defaults. Output with -G:

characters=8, pow62_tab [characters]=218340105584896
n=220100912888
characters=8, pow62_tab [characters]=0
n=0

Output with compiler defaults:

characters=8, pow62_tab [characters]=218340105584896
n=220100912888
characters=8, pow62_tab [characters]=218340105584896
n=220100912888

Wouldn’t it be clearer and more straightforward if trying to apply cudaMemcpyToSymbol() to a const __constant__ object would result in an error?

#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>

#define characters (8)
#define threads    (992)

const __constant__ uint64_t pow62_tab[9] = 
{
    1ull,
    62ull,
    3844ull,
    238328ull,
    14776336ull,
    916132832ull,
    56800235584ull,
    3521614606208ull,
    218340105584896ull
};

uint64_t null_tab[9] = {0,0,0,0,0,0,0,0};

__global__ void kernel (uint64_t *res)
{
    printf ("characters=%d, pow62_tab [characters]=%llu\n", characters,
            pow62_tab [characters]);
    *res = pow62_tab [characters] / threads;
}

int main (void)
{
    uint64_t n, *n_d = 0;
    cudaMalloc ((void**)&n_d, sizeof (n_d[0]));
    kernel<<<1,1>>>(n_d);
    cudaMemcpy (&n, n_d, sizeof n, cudaMemcpyDeviceToHost);
    printf ("n=%llu\n", n);
    cudaDeviceSynchronize();
    cudaMemcpyToSymbol (pow62_tab, null_tab, 9*8, 0, cudaMemcpyHostToDevice);
    kernel<<<1,1>>>(n_d);
    cudaMemcpy (&n, n_d, sizeof n, cudaMemcpyDeviceToHost);
    printf ("n=%llu\n", n);
    cudaDeviceSynchronize();
    return EXIT_SUCCESS;
}

Robert_Crovella · March 7, 2023, 11:33pm

I filed an RFE ( 4014862). I don’t know if anything will come of it.

Topic		Replies	Views
How to debug kernel throwing an exception? CUDA Programming and Performance	16	7958	June 14, 2013
Issue with Nested for-loops in kernel CUDA Programming and Performance	13	2348	April 4, 2019
An Easy Introduction to CUDA C and C++ Technical Blog	48	1259	July 19, 2018
Cuda code performance CUDA Programming and Performance	14	3156	December 16, 2014
Memory problem? ...incredible slowdown CUDA Programming and Performance	29	16310	January 30, 2011
How to choose how many threads/blocks to have? CUDA Programming and Performance	43	52443	June 7, 2022
Cuda program results are always zero in HW, correct in EMU? CUDA Programming and Performance	35	11167	May 23, 2010
Random loads of small unsigned integer values, __constant__ vs. __shared__ vs. global CUDA Programming and Performance	13	1422	August 9, 2015
CUDA Dynamic Parallelism, the code doesn't run as expected CUDA Programming and Performance	18	387	July 20, 2024
Why does cudaStreamAddCallback serialize kernel execution and break concurrency? CUDA Programming and Performance	12	8081	April 5, 2015

Generating 62^8 WiFi password in 0.5s I need cloud for testing

Related topics