from conventional python nested loops to Cuda to run on GPU

Hi everyone, first of all, I need to stress this out. My question may let you think that I may seems like asking some sort of brute force attack or hacking algorithm using GPU. If I want to do that, I could have just download a hacking tool and that’s it. This means I am not. The whole point I m here for the question is the variable of the base, power and modulus will be collected as studies on a certain problem. I need to sample these variables for ANALYSIS.

I need to solve a problem from 1 to 10000 base number, with the power of 1 to 10000, and modulus of 1 to 10000, and print statement as output whenever it match my expected result which is 12345 (as an example). So there will be 1000 billions or 1 trillions in the iteration process.

Conventionally 1 trillion iterations will take too long to solve the problem. That is why I m planning to use the GPU to increase the speed to solve it. Is there any way to turn the conventional code below into Cuda to run GPU?

for i in range(1, 10001):
for x in range(1, 10001):
for y in range(1,10001):
if pow(i,x,y) == 12345:
print("Base of “,i,” with power of “,x,” modulus ",y)

Many thanks!!!

Yes, a brute-force implementation as you have now for this embarrassingly parallel problem would be trivial in CUDA

Too bad, my purpose is not for cracking password or hacking. If not I could simply download a tool to do that.

The concept of the solution is similar to brute force attack, but this is not about cracking. The variable of the base, power and modulus will be collected as studies on a certain problem. I need to sample these variables for analysis.

I need a practical way to write this code to get these variables and move on to analysis. Please, anyone with constructive suggestion.

brute force (as applied to computer algorithms) doesn’t mean exclusively password cracking. It means an exhaustive search of a space.

A CUDA kernel could be something like this:

__global__ void k(){
  int i = threadIdx.x+blockIdx.x*blockDim.x;
  int x = threadIdx.y+blockDim.y*blockIdx.y;
  int y = threadIdx.z+blockDim.z*blockIdx.z;

if ((i < si) && (x < sx) && (y < sy))
    if (((size_t)powf((float)i,(float)x))%y == 12345)
      printf("Base of %d with power of %d, modulus %d\n",i, x, y);
}

This is more or less a trivial, mechanical translation of your python code into a possibly corresponding CUDA kernel. I’m not suggesting this is the best or proper realization. We could have a discussion around nearly every operation in the kernel.

I’m using the word trivial here to denote the idea that even a rudimentary knowledge of CUDA would allow one to make this sort of translation. Therefore I don’t think it adds any value as a response to your question. Therefore I didn’t mention it originally.

You can also do something quite similar in numba (yes, as you mention in your cross-posting on SO, numbapro is no longer supported, but numba is still supported). I’m not 100% certain you can use the exact realization of pow (modulus) that you have, but you could translate it to something equivalent that is supported, as I have done in the CUDA C kernel.

The problem with these sorts of general, open-ended questions, is that it appears that you are either asking someone to write your code for you, or else you are asking someone to teach you CUDA. It’s unclear which, and furthermore, typically in my experience, nobody wants to do either one of those things for you on a forum like this one or like SO. If you know nothing at all about CUDA, the above kernel code isn’t going to help you. As I or someone begins to explain each and every nuance of it, then we might as well be teaching you CUDA. And there are plenty of online resources that allow you to do that yourself.

Again, a trivial, mechanical translation of your code is entirely possible. It’s so trivial and mechanical, however, that one wonders what would actually be useful or constructive in this context.

In CUDA, translating a serial code that is a set of nested loops where the operation in the loop body is independent, is a trivial refactoring process.

For completeness, here is a complete worked code. I’m not going to try to “optimize” the pow operation or even determine “correctness”, because its evident that that is not really what you intend to do anyway.

$ cat t1406.cu
#include <stdio.h>
#include <math.h>

const int si = 10000;
const int sx = 10000;
const int sy = 10000;

__global__ void k(){
  int i = threadIdx.x+blockIdx.x*blockDim.x;
  int x = threadIdx.y+blockDim.y*blockIdx.y;
  int y = threadIdx.z+blockDim.z*blockIdx.z;

if ((i < si) && (x < sx) && (y < sy))
    if (((size_t)powf((float)i,(float)x))%y == 12345)
      printf("Base of %d with power of %d, modulus %d\n",i, x, y);
}

int main(){

  dim3 block(8,8,8);
  dim3 grid((si+block.x-1)/block.x, (sx+block.y-1)/block.y, (sy + block.z -1)/block.z);
  k<<<grid,block>>>();
  cudaDeviceSynchronize();
}
$ nvcc -o t1406 t1406.cu
$ time ./t1406

real    0m19.512s
user    0m14.988s
sys     0m4.510s
$

I think we can agree there should be no printout from this, since anything modulo a number up to 10000 cannot ever equal 12345.

I am very appreciate for your reply. At least it has a clearer insight now for me to know where to dive in a little deeper.

I have gone through some Youtube videos about Cuda, some of them are an hour videos. However this doesn’t mean I get the complete picture of how things work (even though I conceptually understand some of it already), especially or particularly to serve my purpose in codes. I have also gone through the tutorials of different kernel codes in Python, however when I edit the code, it didn’t work as what I have thought. Maybe it’s because the lack of understanding in association of concepts and codes to make it work. As I have mentioned in SO, the kernel codes look so unconventional to me.

Many tutorials or codes from different sources couldn’t work (because some of them were already deprecated aka not up to date), surprisingly including the tutorials in Nvidia site itself https://developer.nvidia.com/how-to-cuda-python, still teaching numbapro which is not supported anymore. If sources out there are so reliable, I could have not running here and there to seek for ways on how to rewrite my code which work on the GPU.

Although your code is in C, at least i have a little insight on how my code will be in similar fashion. Thumbs up!!

It seems you want to perform modular exponentiation. While fast methods for that exist, I am afraid they will not be quite so quick and convenient as txbob’s example code, which uses regular floating-point exponentation. In modular exponentation we are interested in the lowest-order bits of the mathematically correct result, while floating-point exponentation delivers the highest-order bits of that.

My rule of thumb is that search spaces up to about 248 can be covered through brute-force computations, but even that will typically take a number of days on high-end hardware. Your search space seems to be 1015 ~= 2**50, so you better make sure you have multiple high-end GPUs at your disposal, or your computation might stretch into weeks.

Yes, with Tesla K80 hoping to shorten the computation time. Print out statement can be eliminated, but the i,x,y value could be stored into an array which I could look into for analysis. By the way, do you have any better suggestion which I may look into especially when it comes to Python?

If you remove the print statement from the kernel and store results in an array, the previous code I showed could easily be implemented in numba (cuda.jit kernel). In fact it could probably be implemented in a numba vectorize method as well.

I’m not addressing any of the valid points that njuffa raised about the actual arithmetic. As njuffa already pointed out, I’m not claiming correctness of anything I wrote, since what you wrote is clearly a placeholder of some sort.

Thanks txbob for your suggestion. From the very beginning when I first started to on numba, i tried @vectorize method but it keeps return some errors. This make me in search for other alternatives to achieve what I want.

I m still struggling to rewrite your code in Python. By the way, the speed of your code in C is really impressed me…19 secs