cudaLaunchKernel returned status 1: invalid argument

hi, if i have the following simple acc region

!$acc parallel num_gangs(5)
 print*,"abc"
!$acc end parallel

it compiles and runs as expected. but, if i do the following:

!$acc parallel num_gangs(5) num_workers(3)
 print*,"abc"
!$acc end parallel

it compiles without any warnings or errors. but if i run i get

line X: cudaLaunchKernel returned status 1: invalid argument

where X is the line number of

!$acc parallel num_gangs(5) num_workers(3)

can you help me find why it i get this cudaLaunchKernel error?

Not sure where this is coming from, but can work around it if vector_length is also set (See below). Though workers and vectors wont be used here since you have no parallel loops. Instead “gang redundant mode” applies (i.e. each gang will execute the print redundantly), so using “num_workers” here isn’t really valid.

% cat test.f90
program foo
!$acc parallel num_gangs(4) num_workers(3) vector_length(1)
 print*,"abc"
!$acc end parallel
end program foo
% pgfortran -ta=tesla test.f90 -Minfo=accel; a.out
foo:
      2, Generating Tesla code
 abc
 abc
 abc
 abc

-Mat

Thanks. Actually, I think “num_gangs” together with “num_workers” should be valid, of course, if I am not missing anything. I made up this example based on a similar one (Figure 15.5) in “Programming Massively Parallel Processors: A Hands-on Approach” by D.B.Kirk and W.W.Hwu, which is as follows:

#pragma acc parallel copyout(a) num_gangs(1024) num_workers(32)
{
 a=23;
}

Am I missing anything in here?

So, just to clarify for myself, “parallel” without loop, only has “num_gangs” copies executed, independent of “num_workers” and “vector_length” sizes, whereas “parallel loop num_gangs(a) num_workers(b) vector_length©” would divide the total number of loop executions into “abc” independent chunks. Am I right?

So, just to clarify for myself, “parallel” without loop, only has “num_gangs” copies executed, independent of “num_workers” and “vector_length” sizes, whereas “parallel loop num_gangs(a) num_workers(b) vector_length©” would divide the total number of loop executions into “abc” independent chunks.

Correct. A parallel region without a work-shared loop (i.e. without a loop directive) is run in “gang-redundant mode”, each gang executes the same exectutable statements. So it’s not that using “num_workers” here is invalid, it’s just not applicable in this case.

If you did have a loop(s) with a gang, worker, and/or vector clause, then, yes, it would divide the loop across the total number of gangs, workers, and vectors. Though best practice is to not specifically set the number of gangs, workers, or vectors except for specific tuning of algorithms and instead let the compiler and runtime define these values based on the target device. Different target devices may need different values so setting these size will reduce performance portability.

-Mat

thank you!