Generating 62^8 WiFi password in 0.5s I need cloud for testing

I created a simulation program in VS2022 CUDA running solely on one __device__function which generates 200 billion of 8 character password (lower, upper case, digit) in about 500 ms (without printing last 3 passwords, row 73-79). I added to the loop PBKDF2 and HMAC-SHA1 function to search for a password by known PMKID.

For my NVIDIA GeForce GTX 960M I use:

# define blocks 4
# define threads 992
# define characters 8

When I run the program I get an error:
Kernel launch failed: too many resources requested for launch

Is there any chance to try to run it in a cloud on GPU with 8-12 GB for free?

Program for generating 200 billion of 8 character password (62^8):


#include "cuda_runtime.h"
#include "device_launch_parameters.h"

#include <stdio.h>
#include <iostream>
#include <chrono>

using namespace std;
using namespace std::chrono;

// NVIDIA GeForce GTX 960M (device 5.0) has 5 Multiprocessors * 32 threads * 128 CUDA Cores/MP = 20 480
//# define blocks 4
//# define threads 961
//# define characters 4
//# define blocks 4
//# define threads 961
//# define characters 5
//# define blocks 4
//# define threads 961
//# define characters 6
//# define blocks 4
//# define threads 992
//# define characters 7
# define blocks 4
# define threads 992
# define characters 8

cudaError_t cudaStatus;

// "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789"
__constant__ uint8_t charset[] = { 0x61, 0x62, 0x63, 0x64, 0x65, 0x66, 0x67, 0x68, 0x69, 0x6a, 0x6b, 0x6c, 0x6d, 0x6e, 0x6f, 0x70, 0x71, 0x72, 0x73, 0x74, 0x75, 0x76, 0x77, 0x78, 0x79, 0x7a, 0x41, 0x42, 0x43, 0x44, 0x45, 0x46, 0x47, 0x48, 0x49, 0x4a, 0x4b, 0x4c, 0x4d, 0x4e, 0x4f, 0x50, 0x51, 0x52, 0x53, 0x54, 0x55, 0x56, 0x57, 0x58, 0x59, 0x5a, 0x30, 0x31, 0x32, 0x33, 0x34, 0x35, 0x36, 0x37, 0x38, 0x39 };

__global__ void blackcat(void) {

	uint8_t password[characters];

	uint8_t counters[characters];
	uint64_t n = (pow(62, characters) / threads);	// Number of search cycles per thread

	for (int i = characters - 1; i >= 0; i--) {
		counters[i] = (n * threadIdx.x / (uint64_t)pow(62, characters - 1 - i) % 62);
	}

	while (n > 0) {

		bool flag = false;
		for (int i = characters - 1; i >= 0; i--) {
			password[i] = charset[counters[i]];
			if (i == characters - 1) {
				counters[i]++;
				if (counters[i] > 61) {
					counters[i] = (uint8_t)0;
					flag = true;
				}
			}
			else {
				if (flag) {
					counters[i]++;
					if (counters[i] > 61) {
						counters[i] = (uint8_t)0;
					}
					else {
						flag = false;
					}
				}
			}
		}

		// 960
		// Print last three generated passwords
		//if (threadIdx.x == threads - 1 && blockIdx.x == blocks - 1 && n < 4) {
		/*if (threadIdx.x == 0 && blockIdx.x == 0 && n > nn - 3 ) {
			printf("Thread[%d]",threadIdx.x);
			for (int i = 0; i < characters; i++) {
				printf(" %c", password[i]);
			}
			printf("\n");
		}*/

	/* Test whether we found the password,
	if yes, we print the password, terminate all threads and return prematurely from the function,
    it might be a good idea to print the run time occasionally so we know the program is still running */

		n--;
	}
}

int main() {

	auto start = high_resolution_clock::now();
	cudaSetDevice(0);
	cudaStatus = cudaGetLastError();
	    if (cudaStatus != cudaSuccess) {
        fprintf(stderr, "cudaSetDevice failed!  Do you have a CUDA-capable GPU installed?");
    }
	/*blackcat << <1, threads >> > ();*/
	blackcat << <blocks, threads >> > ();
	cudaStatus = cudaGetLastError();
	if (cudaStatus != cudaSuccess) {
		fprintf(stderr, "Kernel launch failed: %s\n", cudaGetErrorString(cudaStatus));
	}
	cudaDeviceSynchronize();
	cudaStatus = cudaGetLastError();
	    if (cudaStatus != cudaSuccess) {
        fprintf(stderr, "cudaDeviceSynchronize returned error code %d after launching addKernel!\n", cudaStatus);
    }

	auto stop = high_resolution_clock::now();
	auto duration = duration_cast<microseconds>(stop - start);
	printf("\nTime ms  = %llx (HEX)\n", duration.count());

	return 0;
}

That probably means (in this case) that you have too many registers per thread to launch a block with 992 threads. You can probably fix this by limiting the registers per thread at compilation, so as to be able to run it on your GPU. See note below.

I believe you should be able to do that with Google Colab. You can find many writeups on the web about how to get started using a GPU in colab.

Note: Your kernel, as posted with things commented out, doesn’t change any global or visible state on the GPU, and so if you are building a release project, the actual kernel code will be empty, due to compiler optimization. If you build a debug project this may not happen (but I wouldn’t recommend evaluating code in any fashion based on a debug build). If you uncomment some of your in-kernel printf statements, then your kernel will no longer be “empty”.

Adding the option --resource-usage to the nvcc command will allow you to see how much memory and how many registers are being used.

Edited: To remove incorrect statement made without thinking.

With CUDA C/C++, Command Line: --resource-usage the full version of program (with PBKDF2 & HMAC-SHA1 shows:

1>ptxas info : 31 bytes gmem, 80 bytes cmem[3]
1>ptxas info : Compiling entry function ‘_Z8blackcatv’ for ‘sm_50’
1>ptxas info : Function properties for _Z8blackcatv
1> 352 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
1>ptxas info : Used 86 registers, 320 bytes cmem[0], 208 bytes cmem[2]
1>ptxas info : Function properties for __internal_accurate_pow
1> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads[2]

Using the Occupancy Calculator, you will need to reduce the register usage down to 64, in order to have one block of 992 threads resident.

you could pass -maxrregcount=64 during compilation. I think there is also a VS project setting to do this, as well - change “Max Used Register” from 0 to 64.

The 64 registers are just the CUDA limit or my GTX 960M limit for 992 threads/block?

The Calculator allows you to change the Compute Capability across the different GPU generations. A check using Ampere 8.0 shows it still needs to be 64 registers.

pow() is an overloaded function and I have not checked how it is being translated here. But in general it is not a good idea to call a potentially complicated function like pow() when register pressure is high. Consider using a small table of integer powers of 62 instead.

Yep. Powers of 62 overflow a 32-bit integer after the 5th power, and overflow a 64-bit integer after the 10th power. If you divide by 1000, it overflows a 64-bit integer after the 12th power. So it’s a pretty small table. Since characters is 8, and a constant, you don’t need more than 8 entries, and the usage to set n is effectively a compile-time constant.

The CUDA limit on registers per thread is 255. If you want to set a lower limit (for occupancy reasons, as is pointed out in this thread) then you can set that lower limit for the compiler in a variety of ways. The objective of setting it here, is to allow the kernel to launch, without hitting the “too many resources” error you posted in the original posting.

Note that I need to spread the ranges of generated passwords evenly over all 992 threads:

uint8_t counters[characters];
	uint64_t n = (pow(62, characters) / threads);	// Number of search cycles per thread

	for (int i = characters - 1; i >= 0; i--) {
		counters[i] = (n * threadIdx.x / (uint64_t)pow(62, characters - 1 - i) % 62);
	}

I don’t think I can avoid to use Powers of 62 function.

log(62^8)/log(2) is 47.63 bits i.e. uint64_t is fine.

The proposal was to implement a powers-of-62 function via a lookup table:

const __constant__ uint64_t pow62_tab[9] = 
{
    1ull,
    62ull,
    3844ull,
    238328ull,
    14776336ull,
    916132832ull,
    56800235584ull,
    3521614606208ull,
    218340105584896ull
};

uint64_t n = pow62_tab [characters] / threads;

counters[i] = (n * threadIdx.x / pow62_tab[characters - 1 - i] % 62);

that can be made a compile-time constant

and your table doesn’t seem to handle the case of 0 index (returning 1 for the zero-th power).

Thanks for pointing that out. This is what happens when one writes code in the browser in between watching YouTube videos. I wonder whether the compiler can optimize pow62_tab [8] / 992 into a constant given that all data is available at compile time. I would hope it does, but have not tried it.

I’m doubtful it can do that when the “constant” is __constant__ memory. Just because you initialize __constant__ memory does not mean something else is not there when we get around to execution.

I’m fairly certain it can be done with ordinary C++ compile-time constants or constexpr.

Yeah, __constant__ is just a storage class specifier, so the table should be const __constant__ for the expression to get optimized into a constant:

#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>

#define characters (8)
#define threads    (992)

const __constant__ uint64_t pow62_tab[9] = 
{
    1ull,
    62ull,
    3844ull,
    238328ull,
    14776336ull,
    916132832ull,
    56800235584ull,
    3521614606208ull,
    218340105584896ull
};

__global__ void kernel (uint64_t *res)
{
    *res = pow62_tab [characters] / threads;
}

int main (void)
{
    uint64_t n, *n_d = 0;
    cudaMalloc ((void**)&n_d, sizeof (n_d[0]));
    kernel<<<1,1>>>(n_d);
    cudaMemcpy (&n, n_d, sizeof n, cudaMemcpyDeviceToHost);
    printf ("n=%llu\n", n);
    return EXIT_SUCCESS;
}

Using CUDA 11.8 and compiling for sm_75:

        code for sm_75
                Function : _Z6kernelPy
        .headerflags    @"EF_CUDA_SM75 EF_CUDA_PTX_SM(EF_CUDA_SM75)"
        /*0000*/                   MOV R1, c[0x0][0x28] ;        /* 0x00000a0000017a02 */
                                                                 /* 0x000fc40000000f00 */
        /*0010*/                   MOV R2, 0x3f0966f8 ;          /* 0x3f0966f800027802 */
                                                                 /* 0x000fe20000000f00 */
        /*0020*/                   ULDC.64 UR4, c[0x0][0x160] ;  /* 0x0000580000047ab9 */
                                                                 /* 0x000fe20000000a00 */
        /*0030*/                   MOV R3, 0x33 ;                /* 0x0000003300037802 */
                                                                 /* 0x000fd00000000f00 */
        /*0040*/                   STG.E.64.SYS [UR4], R2 ;      /* 0x00000002ff007986 */
                                                                 /* 0x000fe2000c10eb04 */
        /*0050*/                   EXIT ;                        /* 0x000000000000794d */
                                                                 /* 0x000fea0003800000 */
        /*0060*/                   BRA 0x60;                     /* 0xfffffff000007947 */
                                                                 /* 0x000fc0000383ffff */
        /*0070*/                   NOP;                          /* 0x0000000000007918 */
                                                                 /* 0x000fc00000000000 */

An attempt to modify a const __constant__ object with cudaMemcpyToSymbol() at runtime appears to be silently ignored.

1 Like

There seems to be some nuance:

$ cat t2205.cu
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>

#define characters (8)
#define threads    (992)

const __constant__ uint64_t pow62_tab[9] =
{
    1ull,
    62ull,
    3844ull,
    238328ull,
    14776336ull,
    916132832ull,
    56800235584ull,
    3521614606208ull,
    218340105584896ull
};

__global__ void kernel (uint64_t *res, unsigned i)
{
    *res = pow62_tab [i] / threads;
}

int main (void)
{
    uint64_t a[9] = {0};
    uint64_t n, *n_d = 0;
    cudaMemcpyToSymbol(pow62_tab, a, 9*sizeof(uint64_t));
    cudaMalloc ((void**)&n_d, sizeof (n_d[0]));
    kernel<<<1,1>>>(n_d, characters);
    cudaMemcpy (&n, n_d, sizeof n, cudaMemcpyDeviceToHost);
    printf ("n=%llu\n", n);
    return EXIT_SUCCESS;
}
$ nvcc -o t2205 t2205.cu
$ compute-sanitizer ./t2205
========= COMPUTE-SANITIZER
n=0
========= ERROR SUMMARY: 0 errors
$

In the case where the compile time constant is discovered by the compiler (your posted case), it seems evident that a cudaMemcpyToSymbol could not affect compiled code. But in the case where the compiler doesn’t discover a compile-time constant, it seems that the contents of const __constant__ do get modified.

Excellent point. This seems to be a somewhat under-specified part of the CUDA language, and it seems to me the current behavior is somewhat non-intuitive, with inconsistent behavior observed depending on whether a compiler optimization was applied or not. My test program below behaves differently depending on whether it is compiled with -G or compiler defaults. Output with -G:

characters=8, pow62_tab [characters]=218340105584896
n=220100912888
characters=8, pow62_tab [characters]=0
n=0

Output with compiler defaults:

characters=8, pow62_tab [characters]=218340105584896
n=220100912888
characters=8, pow62_tab [characters]=218340105584896
n=220100912888

Wouldn’t it be clearer and more straightforward if trying to apply cudaMemcpyToSymbol() to a const __constant__ object would result in an error?

#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>

#define characters (8)
#define threads    (992)

const __constant__ uint64_t pow62_tab[9] = 
{
    1ull,
    62ull,
    3844ull,
    238328ull,
    14776336ull,
    916132832ull,
    56800235584ull,
    3521614606208ull,
    218340105584896ull
};

uint64_t null_tab[9] = {0,0,0,0,0,0,0,0};

__global__ void kernel (uint64_t *res)
{
    printf ("characters=%d, pow62_tab [characters]=%llu\n", characters,
            pow62_tab [characters]);
    *res = pow62_tab [characters] / threads;
}

int main (void)
{
    uint64_t n, *n_d = 0;
    cudaMalloc ((void**)&n_d, sizeof (n_d[0]));
    kernel<<<1,1>>>(n_d);
    cudaMemcpy (&n, n_d, sizeof n, cudaMemcpyDeviceToHost);
    printf ("n=%llu\n", n);
    cudaDeviceSynchronize();
    cudaMemcpyToSymbol (pow62_tab, null_tab, 9*8, 0, cudaMemcpyHostToDevice);
    kernel<<<1,1>>>(n_d);
    cudaMemcpy (&n, n_d, sizeof n, cudaMemcpyDeviceToHost);
    printf ("n=%llu\n", n);
    cudaDeviceSynchronize();
    return EXIT_SUCCESS;
}

I filed an RFE ( 4014862). I don’t know if anything will come of it.