This is just a general question, but one I figure I should know the answer to. I have a very large piece of code that I’m using on the GPUs and I’ve been experimenting a bit with the scheduling of the big outer loop. I’m fairly certain that the best I can do is “parallel, vector(32)” on the outer loop. Being logical and all, I then tried “parallel, vector(64)” after seeing the success of 32 and I saw in the compiler output that this was reduced to just “parallel”. I guess I’m wondering, what determines this “endpoint” of vector width? I assume it has to do with the physical limits of the accelerator card (registers available would be my guess), but I’d rather know from those in the know.
Also, as an aside, whatever I do inside this loop never seems to matter and/or is never referred to by the compiler. That is, I can see this:
593, Loop is parallelizable 602, Loop is parallelizable 656, Loop is parallelizable
on compiling. If I go into the code and add “!$acc do parallel” around those inner loops, I then see:
594, Loop is parallelizable 604, Loop is parallelizable 658, Loop is parallelizable
but no explicit output saying it did parallelize it. Should I then assume that it just can’t, and as such while it might be parallelizable, it’s still run sequentially?