Hi,

I have noticed that when running OpenACC with pgi version 14.2.0, and previous, that when using the async(i) clause on a parallel region that a maximum of 8 kernels get executed concurrently. I am running on an NVIDIA k20x that allows 32 concurrent streams to be run on the GPU. Is this a known restriction? The output of a simple test running on a single GPU thread, vector_length(1) num_gangs(1) , shows the 8 stream stepped behavior. For each test n kernels were launched with async(i), with i ranging from [0,n).

Total time for 1 kernels: 1311.234000 ms

Total time for 2 kernels: 1065.492000 ms

Total time for 3 kernels: 1065.252000 ms

Total time for 4 kernels: 1065.318000 ms

Total time for 5 kernels: 1065.248000 ms

Total time for 6 kernels: 1065.305000 ms

Total time for 7 kernels: 1065.297000 ms

Total time for 8 kernels: 2128.664000 ms

Total time for 9 kernels: 2128.663000 ms

Total time for 10 kernels: 2128.679000 ms

Total time for 11 kernels: 2280.262000 ms

Total time for 12 kernels: 2187.881000 ms

Total time for 13 kernels: 2128.892000 ms

Total time for 14 kernels: 2128.909000 ms

Total time for 15 kernels: 2128.735000 ms

Total time for 16 kernels: 3192.994000 ms

Total time for 17 kernels: 3192.979000 ms

Total time for 18 kernels: 3193.067000 ms

Total time for 19 kernels: 3193.120000 ms

Total time for 20 kernels: 3193.024000 ms

Total time for 21 kernels: 3193.026000 ms

Total time for 22 kernels: 3193.026000 ms

Total time for 23 kernels: 3193.033000 ms

Total time for 24 kernels: 4257.317000 ms

Total time for 25 kernels: 4257.311000 ms

Total time for 26 kernels: 4257.326000 ms

Total time for 27 kernels: 4257.344000 ms

Total time for 28 kernels: 4257.247000 ms

Total time for 29 kernels: 4257.539000 ms

Total time for 30 kernels: 4257.382000 ms

Total time for 31 kernels: 4257.342000 ms

Total time for 32 kernels: 5321.640000 ms

A similar test written in CUDA produces the following expected results:

Total time for 1 kernels: 1218.388000 ms

Total time for 2 kernels: 1000.558000 ms

Total time for 3 kernels: 1000.881000 ms

Total time for 4 kernels: 1000.591000 ms

Total time for 5 kernels: 1000.598000 ms

Total time for 6 kernels: 1000.622000 ms

Total time for 7 kernels: 1000.632000 ms

Total time for 8 kernels: 1000.659000 ms

Total time for 9 kernels: 1000.666000 ms

Total time for 10 kernels: 1000.689000 ms

Total time for 11 kernels: 1000.818000 ms

Total time for 12 kernels: 1000.659000 ms

Total time for 13 kernels: 1000.703000 ms

Total time for 14 kernels: 1000.706000 ms

Total time for 15 kernels: 1000.716000 ms

Total time for 16 kernels: 1000.721000 ms

Total time for 17 kernels: 1000.746000 ms

Total time for 18 kernels: 1000.756000 ms

Total time for 19 kernels: 1001.056000 ms

Total time for 20 kernels: 1000.761000 ms

Total time for 21 kernels: 1000.799000 ms

Total time for 22 kernels: 1000.805000 ms

Total time for 23 kernels: 1000.836000 ms

Total time for 24 kernels: 1000.840000 ms

Total time for 25 kernels: 1000.846000 ms

Total time for 26 kernels: 1000.860000 ms

Total time for 27 kernels: 1001.327000 ms

Total time for 28 kernels: 1001.451000 ms

Total time for 29 kernels: 1001.165000 ms

Total time for 30 kernels: 1000.928000 ms

Total time for 31 kernels: 1000.937000 ms

Total time for 32 kernels: 1000.967000 ms

Total time for 33 kernels: 2001.787000 ms

I can provide the reproducer code if desired, the OpenACC code is compiled with the following flags: -acc -ta=nvidia,kepler .

Thanks for any help,

Adam Simpson