Slow performance with nested for loops ? Very slow compilation and execution with nested for loops

Hi,
I have a kernel which has like 9 nested for loops ( ie the main for loop contains 8 for loops within it). In 7 of these loops, the number of iterations is determined on runtime. Issues:

  1. The code takes a lot of time to compile.
  2. Even without any computations in the for loop itself, for a input that has total no of iterations as 10^7 (ie. for all the for loops), the code requires about 80 seconds worth of time. This somehow seems a lot to me, since I was running the CPU code earlier, and it did not seem to take this long at all.

The number of registers per thread is 124, and I have used the CUDA occupancy calculator to determine the no of threads per block as 128, in which it should give me a 13 % occupancy.

Could anyone let me know how to improve performance of this piece of code, or is this the maximum I could attain ? Also any tips on how do I reduce the compilation time ?

I could reduce the number of nested for loops, by having one for loop, and recomputing the index value inside each iteration. However, to compute the indices itself will be a computationally intensive task. Recommended ?

Thanks,
akj

This is because the compiler aggressively unrolls the loops. You can try to prevent unrolling some of the loops with [font=“Courier New”]#pragma unroll 1[/font] immediately before the [font=“Courier New”]for[/font] statement. This will definitely speed up compilation, and hopefully also execution by preventing the instruction cache from overflowing.

Thank you for the reply. I added the pragma unroll directive, and as you had expected, it has reduced the compilation time. However, the execution time has increased significantly. In fact it has already been 10 minutes and the code is still executing.

I do know the size of the loop for two of the nine nested for loops. Is there an optimization that I can achieve given this ?

Thank you,
akj

Try to prevent unrolling more of the loops - it might be that the instruction cache still overflows.

Hi Tera. Thank you for the response. Do you mean that I should not unroll as many loops as possible ? If so, that is what I have done. All my loops have the #pragma unroll directive before them.

In the meanwhile my code is still executing, and its already about 40 minutes :(

-akj

Does it work right?

Yes, that’s what I meant - at least for testing purposes. If it had turned out to work, it would require a bit of experimentation how many loops to unroll.

Can you have a look at the sizes of the ptx file without (and possibly with, if that doesn’t take too long) unrolling?

I’m a bit confused though - haven’t you stated that you removed all code from inside the loops? If so, the compiler should easily detect that and completely optimize the loops away. So I wonder if something else is going wrong here.

Thank you for the replies. I had actually not removed a few increment statements within the loops. Now even they are removed.

With this new code, when I checked the ptx file, it showed that “without” the pragma unroll 1 directive, there was just no code within the function.

“With” the pragma unroll 1 directive, the size of the ptx file increases ( It so appears that it has all the for loops, and their index calculations, written once).

The execution time for both the cases is 74 s (timed using cudatimer on the host ). I do not understand why it should take so long to execute ?

If anyone would like me to attach the ptx files for the two cases, I can do that.

Any ideas on what might be going wrong or any other tests that I should do ?

No, don’t remove them! We do not want the compiler to fully optimize the loops away, because then we cannot learn anything about the problem.

Yes, that’s as expected.

Interesting - so the pragma also seems to instruct the compiler not to optimize the loop away (maybe [font=“Courier New”]#pragma unroll 0[/font] would do that? At least if you take it literally… :) )

So does the ptx file have a massive size then, corresponding to the 10[sup]7[/sup] loop iterations?

Ah, so the execution time problem clearly comes from something else then, if even an empty kernel takes that long. How many times do you launch that kernel?

As the problem seems unrelated, I don’t see a point at the moment.

I’d look at the code that calls the kernel. Is the kernel called in a loop? If not, does a second call within the same program again take 74 s? (Note that the driver still has to compile the ptx to device code on the first invocation - so 74 s seem reasonable for the massive unrolled ptx file, but not for an empty kernel)

Thanks a lot for the reply Tera and am very sorry about the delayed response. Had got stuck with another task.

After adding some instructions in the loop, the program takes a whole lot of time to execute ( 840 s). Also, on adding the code, the ptx files for the unrolled and rolled cases are not very different.

No. In fact, they are just like 200 lines of code in the ptx file, even with some code inside the for loops. I really suspect that there is another issue apart from the unrolling one.

Just once. In fact it is the only kernel that I am executing.

The kernel is not called in a loop. When called again, the kernel “again” takes 74 s.

I too suspect that there is something else going wrong here. The issue is that I am not sure how to figure out what is going wrong here, since even with a single kernel invocation, and with nothing inside the loops, the program just takes so long to execute. And with just minimal code inside the for loops, the program takes 840 s to execute. All this with the ptx file not really being very large. I feel really so lost…

Yes, I also believe that loop unrolling isn’t the problem here. Can you post your kernel code?

My code consists of CPU code scattered across various files. Therefore, I am trying to shorten it as much as possible, and will soon post the kernel code. Thanks…

After my kernel, there was a cudaThreadSynchronize function, and I was measuring the time taken after that. I instead measured the time before cudaThreadSynchronize was called, and it was very less… However, the program still takes a lot of time after that to execute completely in any case…

I am posting my code below:

[codebox]

#include<cuda.h>

#include<stdio.h>

#include<cutil.h>

#include<cutil_inline_runtime.h>

#define complex_double double2

#define ANG_RES 100

global void test_dev(complex_double *diag_terms)

{

int theta_count, phi_count;

int face_calc;

int face = 1;

int l,m,lp,mp,cnt,cntp;

int r_ind;

int tid;

int i,j,k;

double sph_x, sph_y, sph_z;

double face_x, face_y, face_z;

double dist_self;

double del_phi, del_theta;

double theta_self, phi_self;

float ii_self, jj_self, kk_self;

float cube_x, cube_y, cube_z;

tid = blockIdx.x * blockDim.x + threadIdx.x;

/* TIND_TO_VOX_XYZ(tid,i,j,k,info_stat.nX, info_stat.nY, info_stat.nZ);

i = i + info_stat.bounX;

j = j + info_stat.bounY;

k = k + info_stat.bounZ;

r_ind = i* (info_stat.nY + 2*info_stat.bounY )* (info_stat.nZ + 2*info_stat.bounZ) + j* (info_stat.nZ + 2*info_stat.bounZ) + k;

*/

for (ii_self = -4.5; ii_self <= 4.5; ii_self++) {

    for (jj_self = -4.5; jj_self <= 4.5; jj_self++) {

        for (kk_self = -4.5; kk_self <= 4.5; kk_self++) {

            for ( theta_count = 0; theta_count < ANG_RES; theta_count++){

                theta_self = theta_count * M_PI / ANG_RES;

                for ( phi_count = 0; phi_count  < ANG_RES; phi_count++){

                    phi_self = phi_count * 2.0*M_PI / ANG_RES;

                    sph_x = cos(phi_self) * sin(theta_self);

                    sph_y = sin(phi_self) * sin(theta_self);

                    sph_z = cos(theta_self);

}

             }

        }

    }

}

}

int main(int argc, char **argv){

cudaSetDevice(3);

int size;

size = 125;

complex_double *diag_terms_dev;

cutilSafeCall(cudaMalloc(&diag_terms_dev,sizeof(complex_double)*size ));

cutilSafeCall(cudaMemset(diag_terms_dev,0, sizeof(complex_double)*size));

unsigned int timer=0;

cutCreateTimer(&timer);

cutResetTimer(timer);

cutStartTimer(timer);

test_dev<<<1,128>>>(diag_terms_dev);

cutStopTimer(timer);

printf(“Time taken on the kernel is %f ms \n”, timer);

cudaThreadSynchronize(); // This part takes a lot of time

cudaFree(diag_terms_dev); 

return(0);

}

[/codebox]

I think it is still the kernel that is taking very long… Any inputs on why this is the case and how I could reduce this time ?

Looks like I might have just figured out what the issue with this particular code was… I was compiling the code with a -g -G flag ( for debug purposes). When I removed these flags, the kernel executed very fast…

Thank you all for all the help, especially Tera.

I might come back to this thread, since I feel that I might still have issues due to these nested for loops that I have, and how to optimize my program the best. Should I be starting a new thread in that case, or should I just continue with this thread ?

There are a lot of things to improve in this code.
And moreover you are wrong.
These options did not helped.

Hi Lev, I would be glad to know the various things that I could improve in this code, as I am also looking forward to improve it.

Also, the options helped me. Can you send me your compile and linking flags ?

in fact, you need to compute 200 angles, why not computing 200 cos and sin first by using all threads in a thread block

and then store them to shared memory.

You can think about how to simplify index computation.

This would save cost of cos and sin because cos and sin are time-consuming

Thank you for the suggestion LSChein, but I am not sure if I understood it right.

I will write a different kernel and post it here for review soon.

A simpler change that would realize part of that speedup would be to just move the loops over theta and phi, together width the sine and cosine computations, outward of the other loops.

There are so many things. For example, mixing double precision and single precision. Depends on compiler options it may matter a lot. Using high precision sin and cos functions instead of approximation. Using sin and cos instead of sincos. But it is not all so improtant as your grid and block size.
test_dev<<<1,128>>>(diag_terms_dev); You simply use small amount of gpu. Btw, it release mode, cause you do not write anything, compiler just remove all code, that way it is so fast.