Large grids return cuStreamSynchronize error 700


I am running the following code using openacc.
Openacc loop on the gpu runs fine for grids upto 50x50. For large grids, say, 60x60, it returns the following error :

call to cuEventElapsedTime returned error 700: Illegal address during kernel execution

I also see this in output: not found

I have tried the same code for 60x60 grid on a single cpu thread and it runs fine. So, it’s not an error with the code itself.

My system details : ubuntu 14.04, tesla K40c, pgcc V15.5
About the code :
Build : cd Tools/ ; make all -f makefile_gpu
Run : cd …/Fitzhugh_Generic/ ; python2.7

Size of the grid can be specified by changing the second last line of as : tor.sweep_allInC(60);

OpenAcc code starts from the function sweeptraces() under Tools/fitzhugh.c which is called by Fitzhugh_Generic/ via Tools/

Any help is greatly appreciated.


Hi Krishna,

This is pretty cool. I haven’t tried calling PGI compiled OpenACC from python before. Glad to hear that it works (at least for the 50x50 case). However, I’ll need to get Python 2.7 and Scipy installed on my system before I can recreate the error.

Though typically when I see this type of error it’s due to the array size growing over 2GB. Granted, 60x60 doesn’t sound too big, but try adding “-Mlarge_arrays” to you compile flags to see if it fixes the problem.

“” is a CUDA profile library being brought in when you use the “-ta=tesla:time” option. Set the LD_LIBRARY_PATH environment variable to include “$PGI/linux86-64/2015/cuda/6.5/lib64/” to have it loaded, or remove the “time” option. Note that we’re encouraging folks to not use “time” and instead use the environment variable “PGI_ACC_TIME=1” to enable profiling information. That way, you don’t need recompile to turn it off.

  • Mat

Hi Mat,

Thanks for the response.
Yes, the array size could be larger than 2gb since for each point in the 60x60 grid a lot of computations are performed. I have tried using -Mlarge_arrays flag but it still doesn’t resolve the issue.

I have added a test case within the C file itself. So you should be able to run the openacc c code directly without python :
From :
get Tools/fitzhugh.c and Fitzhugh_Generic/isa.txt

In fitzhugh.c specify the location of the downloaded isa.txt file in the main function in this line :
isa = fopen(" … ",“r”);

Also you can change the size of the grid in this line under main() :
sweeptraces(4, initial_states_array, p, coupling_strengths,trajectory_targets_phasedifferences,3600,
(double)3/1000, 21275, 30);

The number ‘3600’ (6th parameter) here specifies the grid 60x60.
This test case should be running fine for all values upto 3600 for this parameter. It runs fine for 2500 but for 3600, it gives 700 error.

Thanks so much again for all the help.


While trying to debug this, I have come across the following:

#include <stdio.h>
int main(){
double initial_states_array;int n;
printf(“Enter size of grid\n”);
initial_states_array = (double
printf(“ISA last %lf \n”, initial_states_array[8

This code when compiled with gcc, runs fine for any input value , say, 50. But with pgcc the same gives a segmentation fault. What am I doing wrong here?

Hi Krishna,

But with pgcc the same gives a segmentation fault. What am I doing wrong here?

I think this warning is your issue:

% pgcc -fast testBig.c -V15.5
PGC-W-0155-Pointer value created from a nonlong integral type  (testBig.c: 7)
PGC/x86-64 Linux 15.5-0: compilation completed with warnings

We don’t include stdlib.h by default so malloc’s prototype gets implicitly defined. Including stdlib.h fixes the issue.

% cat testBig.c
#include <stdlib.h>
#include <stdio.h>
 int main(){
 double *initial_states_array;int n;
 printf("Enter size of grid\n");
 initial_states_array = (double*)malloc(8*n*n*sizeof(double));
 printf("ISA last %lf \n", initial_states_array[8*n*n-1]);

% pgcc -fast testBig.c -V15.5
% a.out
Enter size of grid
ISA last 20.000000

Note that this appears to be part of the problem for the larger case, but there’s more going on. I’m investigating.

  • Mat

Hi Mat,

Yes, including stdlib fixes the segmentation fault issue for the short code.

I have also now updated the file fitzhugh.c with the test case so it can be run by itself without need for any other file.
You can enter the size of the grid ‘n’ as, say, 50 or 60.
For each point in the nxn grid, now it basically gives the same input conditions and essentially performs the same computations for all nxn points.
As before, runs fine for 50x50 but not 60x60.


The problem with the main code is the “output” array. It’s 85,100 elements and has a “double” data type. Since “output” is privatized, every thread will get it’s own copy. The compiler does this by allocating one large array. For an input size of 60, 3712 threads are created. Hence the “output” array is 85,1003,7128 bytes, or ~2.3GB.

The compiler should be accounting for the large private array when “-Mlarge_arrays” is used, but isn’t for this case. Hence, I’ve added TPR#21720 and sent it to our engineers for further investigation.

The work around is to either limit the number of threads by adding the “num_gangs(16) vector_length(128)” clause to your loop schedule, or manually privatize “output” by adding a second dimension to the array (sized to “noOfInitialStates”) and then copying it to the device.

Hope this helps,

Hi Mat,

Great! Manual privatization solves the issue! Finally everything is working now.
Thank you so much for all the help. Really appreciate it.