cudaLaunchKernel returned status 1: invalid argument

mr0202 · March 24, 2020, 5:36am

hi, if i have the following simple acc region

!$acc parallel num_gangs(5)
 print*,"abc"
!$acc end parallel

it compiles and runs as expected. but, if i do the following:

!$acc parallel num_gangs(5) num_workers(3)
 print*,"abc"
!$acc end parallel

it compiles without any warnings or errors. but if i run i get

line X: cudaLaunchKernel returned status 1: invalid argument

where X is the line number of

!$acc parallel num_gangs(5) num_workers(3)

can you help me find why it i get this cudaLaunchKernel error?

MatColgrove · March 24, 2020, 2:29pm

Not sure where this is coming from, but can work around it if vector_length is also set (See below). Though workers and vectors wont be used here since you have no parallel loops. Instead “gang redundant mode” applies (i.e. each gang will execute the print redundantly), so using “num_workers” here isn’t really valid.

% cat test.f90
program foo
!$acc parallel num_gangs(4) num_workers(3) vector_length(1)
 print*,"abc"
!$acc end parallel
end program foo
% pgfortran -ta=tesla test.f90 -Minfo=accel; a.out
foo:
      2, Generating Tesla code
 abc
 abc
 abc
 abc

-Mat

mr0202 · March 24, 2020, 5:30pm

Thanks. Actually, I think “num_gangs” together with “num_workers” should be valid, of course, if I am not missing anything. I made up this example based on a similar one (Figure 15.5) in “Programming Massively Parallel Processors: A Hands-on Approach” by D.B.Kirk and W.W.Hwu, which is as follows:

#pragma acc parallel copyout(a) num_gangs(1024) num_workers(32)
{
 a=23;
}

Am I missing anything in here?

So, just to clarify for myself, “parallel” without loop, only has “num_gangs” copies executed, independent of “num_workers” and “vector_length” sizes, whereas “parallel loop num_gangs(a) num_workers(b) vector_length(c)” would divide the total number of loop executions into “abc” independent chunks. Am I right?

MatColgrove · March 25, 2020, 2:12pm

So, just to clarify for myself, “parallel” without loop, only has “num_gangs” copies executed, independent of “num_workers” and “vector_length” sizes, whereas “parallel loop num_gangs(a) num_workers(b) vector_length(c)” would divide the total number of loop executions into “abc” independent chunks.

Correct. A parallel region without a work-shared loop (i.e. without a loop directive) is run in “gang-redundant mode”, each gang executes the same exectutable statements. So it’s not that using “num_workers” here is invalid, it’s just not applicable in this case.

If you did have a loop(s) with a gang, worker, and/or vector clause, then, yes, it would divide the loop across the total number of gangs, workers, and vectors. Though best practice is to not specifically set the number of gangs, workers, or vectors except for specific tuning of algorithms and instead let the compiler and runtime define these values based on the target device. Different target devices may need different values so setting these size will reduce performance portability.

-Mat

mr0202 · March 25, 2020, 5:26pm

thank you!

Topic		Replies	Views
Clause 'Worker(value)' not allowed in 'Parallel Loop' direct Legacy PGI Compilers	2	1906	April 17, 2018
gang and worker Legacy PGI Compilers	3	2427	May 7, 2013
Optimize runtime Legacy PGI Compilers	3	2613	April 17, 2018
Questions about "parallel" and "loop" Legacy PGI Compilers	1	2681	August 5, 2015
questions about #threads Legacy PGI Compilers	5	4179	August 3, 2015
paralle + independent and kernels + vector_length() Legacy PGI Compilers	5	4155	August 20, 2012
OpenACC Loop Organization Legacy PGI Compilers	3	2382	February 5, 2016
Computing multiple elements per thread in OpenACC Legacy PGI Compilers	3	2504	May 17, 2013
about gang and worker Legacy PGI Compilers	3	2315	November 21, 2012
Mapping between OpenACC and CUDA parallelism levels Legacy PGI Compilers	3	6685	April 16, 2015

cudaLaunchKernel returned status 1: invalid argument

Related topics