Limits on vector width for large loop

This is just a general question, but one I figure I should know the answer to. I have a very large piece of code that I’m using on the GPUs and I’ve been experimenting a bit with the scheduling of the big outer loop. I’m fairly certain that the best I can do is “parallel, vector(32)” on the outer loop. Being logical and all, I then tried “parallel, vector(64)” after seeing the success of 32 and I saw in the compiler output that this was reduced to just “parallel”. I guess I’m wondering, what determines this “endpoint” of vector width? I assume it has to do with the physical limits of the accelerator card (registers available would be my guess), but I’d rather know from those in the know.

Also, as an aside, whatever I do inside this loop never seems to matter and/or is never referred to by the compiler. That is, I can see this:

    593, Loop is parallelizable
    602, Loop is parallelizable
    656, Loop is parallelizable

on compiling. If I go into the code and add “!$acc do parallel” around those inner loops, I then see:

    594, Loop is parallelizable
    604, Loop is parallelizable
    658, Loop is parallelizable

but no explicit output saying it did parallelize it. Should I then assume that it just can’t, and as such while it might be parallelizable, it’s still run sequentially?


Hi Matt. I’m a little puzzled, but I’ll try to explain what might be happening.
I’m guessing that the inner loops are not tightly nested in the big outer loop. The way the compiler works now is to follow the CUDA / OpenCL kernel model pretty closely, so only a tightly nested loop nest can be parallelized. If you have an outer loop with one or more inner loops, those inner loops can’t be parallelized; each parallel loop or parallel loop nest has to be turned into a single kernel. We’re looking at ways to extend the model, but there are serious limits on what the hardware can support.
Your primary question is about the vector(32) or vector(64). That’s quite a bit more puzzling. I’ve tried to reproduce it with examples here, and was unable to do so. It shouldn’t work that way, so there must be something wrong in the logic of the compiler. Your example program would really help here, but we’ll keep trying to find the problem.
Thanks for the feedback.

Dr Wolfe,

Thanks for the reply. I’m going to send a sample tarball that demonstrates this problem to with a note to forward onto you (not too sure I can make public this code yet). When you see the code, you’ll see that I fused the original code all into one big loop. I’m wondering if, for best use on accelerators, if that was wrong? The loop fusion did lower the memory needs by a lot (removed a dimension that can be order 1000 or 10000 at times) but I can’t really reloop and fuse the second dimension due to some unavoidable loop dependencies. (Well, unavoidable as far as I can tell, but I can only usually spot the simple, obvious ones that can be changed.)

It is also possible the schedule I asked for (parallel, vector(32)) kills the math, as I don’t seem to get very accurate results compared to original. But, “do parallel” alone leads to the same answers, and this code, at least, should be embarrassingly parallelizable across the outer loop.

As for the vector(32) and vector(64), this example should demonstrate it. At this point, I yield to your expertise!

Ah! Dr Wolfe, I might have an answer to your confusion about the vector width business. My previous attempts had all used PGI 10.1. However, with the snow abating here in DC, I was able to get 10.2 installed this morning. Upon doing so, I am now able to specify other widths for the vector statement and those are used by the compiler. (Even to the point of idiocy on my part: call to cuLaunchGrid returned error 701: Launch out of resources.)

This problem might be related to an issue I had off-forum with Mat wherein “!$acc do kernel” did not work for me previously (in 10.1, I think). This was fixed in the development kernel at that time, and perhaps might have also fixed this issue as well.