This is a brute-force method to find armstrong numbers. There are (at least) two levels of parallelism in this code:
The operations done over the brute-force search domain can be done in parallel, because they are completely independent. The operation to check if a given integer i is an armstrong number is completely independent of the operation for j, if j != i. This level of parallelism corresponds to the first for-loop in your program. For a beginner in CUDA programming, I would suggest tackling this first, because the remainder of the programming task is quite simple.
The computation of whether or not a given number is an armstrong number also has inherent parallelism. Basically we must:
- take each digit and raise it to the power of the length of the number
- sum the results of raising the digits to a power
- check if the result is equal to the original number
The first two bullets above have parallelism, albeit two different types. The first bullet is independent parallelism, not unlike the suggestion in step 1, and we can perform the calculations on each digit independently, in parallel. The second bullet is another type of dependent parallelism, called a reduction. Summing together a set of numbers to produce a single number result is a reduction. Rather than trying to explain a reduction here, google for Mark Harris’ work on parallel reductions.
Therefore a simple approach to handling just the parallelism in item 1 would be to create a CUDA kernel that operates on a grid (number of threads) that is the size of the domain to be searched. Each thread will check a single element in that domain. This is a linear (1D) domain so a 1D kernel could work well here. The actual kernel code would be almost exactly the remainder of the calculation code you have shown.
You could start with the CUDA vector Add sample code. Replace the kernel code in vector add with your calculation code (i.e. with essentially everything inside the first for-loop). Make the “vector length” equal to the size of your input. The vector output would just be a true/false array indicating whether each number is an Armstrong number or not. Each thread would set its vector output position to true or false based on the result of the if(j==check) condition in your (kernel) code. You need not pass any data to the kernel - the “input” is just calculatable from the thread index variable created in the vector add sample code.
I would suggest, for learning purposes, trying to tackle the item 1 parallelism, and get that code working first, before trying to proceed with item 2.