brute force (as applied to computer algorithms) doesn’t mean exclusively password cracking. It means an exhaustive search of a space.

A CUDA kernel could be something like this:

```
__global__ void k(){
int i = threadIdx.x+blockIdx.x*blockDim.x;
int x = threadIdx.y+blockDim.y*blockIdx.y;
int y = threadIdx.z+blockDim.z*blockIdx.z;
if ((i < si) && (x < sx) && (y < sy))
if (((size_t)powf((float)i,(float)x))%y == 12345)
printf("Base of %d with power of %d, modulus %d\n",i, x, y);
}
```

This is more or less a trivial, mechanical translation of your python code into a possibly corresponding CUDA kernel. I’m not suggesting this is the best or proper realization. We could have a discussion around nearly every operation in the kernel.

I’m using the word trivial here to denote the idea that even a rudimentary knowledge of CUDA would allow one to make this sort of translation. Therefore I don’t think it adds any value as a response to your question. Therefore I didn’t mention it originally.

You can also do something quite similar in numba (yes, as you mention in your cross-posting on SO, numbapro is no longer supported, but numba is still supported). I’m not 100% certain you can use the exact realization of pow (modulus) that you have, but you could translate it to something equivalent that is supported, as I have done in the CUDA C kernel.

The problem with these sorts of general, open-ended questions, is that it appears that you are either asking someone to write your code for you, or else you are asking someone to teach you CUDA. It’s unclear which, and furthermore, typically in my experience, nobody wants to do either one of those things for you on a forum like this one or like SO. If you know nothing at all about CUDA, the above kernel code isn’t going to help you. As I or someone begins to explain each and every nuance of it, then we might as well be teaching you CUDA. And there are plenty of online resources that allow you to do that yourself.

Again, a trivial, mechanical translation of your code is entirely possible. It’s so trivial and mechanical, however, that one wonders what would actually be useful or constructive in this context.

In CUDA, translating a serial code that is a set of nested loops where the operation in the loop body is independent, is a trivial refactoring process.

For completeness, here is a complete worked code. I’m not going to try to “optimize” the pow operation or even determine “correctness”, because its evident that that is not really what you intend to do anyway.

```
$ cat t1406.cu
#include <stdio.h>
#include <math.h>
const int si = 10000;
const int sx = 10000;
const int sy = 10000;
__global__ void k(){
int i = threadIdx.x+blockIdx.x*blockDim.x;
int x = threadIdx.y+blockDim.y*blockIdx.y;
int y = threadIdx.z+blockDim.z*blockIdx.z;
if ((i < si) && (x < sx) && (y < sy))
if (((size_t)powf((float)i,(float)x))%y == 12345)
printf("Base of %d with power of %d, modulus %d\n",i, x, y);
}
int main(){
dim3 block(8,8,8);
dim3 grid((si+block.x-1)/block.x, (sx+block.y-1)/block.y, (sy + block.z -1)/block.z);
k<<<grid,block>>>();
cudaDeviceSynchronize();
}
$ nvcc -o t1406 t1406.cu
$ time ./t1406
real 0m19.512s
user 0m14.988s
sys 0m4.510s
$
```

I think we can agree there should be no printout from this, since anything modulo a number up to 10000 cannot ever equal 12345.