I have a piece of F90 code that looks something like this (this is representative code only):
do i = 1, n1
do j = 1, n2
r(j,i) = r(j,i) + m(i)*s(j,i)
the variable ‘n2’ is generally 5 in length, and the variable n1 is generally around 10^6 in length. when I compile without OpenMP pragmas (v11.7). I get a compiler message that looks like this:
55, Generated vector sse code for the loop
Generated 2 prefetch instructions for the loop
Residual loop unrolled 1 times (completely unrolled)
so the compiler is vectorizing the outer-most loop. However when I add OpenMP pragmas around the outermost loop I get the following message:
52, Parallel region activated
55, Parallel loop activated with static block schedule
57, Loop not vectorized: loop count too small
Loop unrolled 5 times (completely unrolled)
These messages lead me to believe that the compiler will vectorize OR parallelize the loop, but not both. In other words, it does not parallelize the loop and then vectorize what gets executed on each thread. If this is true, it would mean that I am losing the benefits of vectorization when running under OpenMP.
can anyone confirm this, and if it is true, suggest a workaround?