“loop is parallelizable” indicates the compiler has done it’s dependency analysis and determined that the loop could be parallelized. It doesn’t it will be parallelized, just that it can.
“Accelerator kernel generated” means that the compiler created device code and then the next few lines in the feedback messages will indicate how it was scheduled.
Also, if I declare “routine seq” in ahead of a function and leave
all the function code unmodified, I guess all the loops inside the function
would not be parallelized by the compiler automatically, is it correct ?
Correct. If you want to expose parallelism inside a device routine, use the “routine vector” or “routine worker” directives instead of “seq”, then use the “loop” directive to indicate which loops to parallelize.
Note that “vector” routines can only be called from a “gang” or “worker” loop, and “worker” only from a “gang”. Also when using a “vector” or “worker” routine, call them from within a “parallel” compute region and use the “num_worker” or “vector_length” clauses. The compiler can tell at compile time how many vectors or workers a loop in a routine will be used since it’s dependent on where it’s called from. Hence, it defaults to 32 on an NVIDIA system if the number of workers and vector length are not set.