Large grids return cuStreamSynchronize error 700

krishna_pusuluri · June 9, 2015, 10:11pm

Hi,

I am running the following code using openacc.
Openacc loop on the gpu runs fine for grids upto 50x50. For large grids, say, 60x60, it returns the following error :

call to cuEventElapsedTime returned error 700: Illegal address during kernel execution

I also see this in output: libcupti.so not found

I have tried the same code for 60x60 grid on a single cpu thread and it runs fine. So, it’s not an error with the code itself.

My system details : ubuntu 14.04, tesla K40c, pgcc V15.5
About the code :
https://bitbucket.org/pusuluri_krishna/genericmotifs/src
Build : cd Tools/ ; make all -f makefile_gpu
Run : cd …/Fitzhugh_Generic/ ; python2.7 torus_4.py

Size of the grid can be specified by changing the second last line of torus_4.py as : tor.sweep_allInC(60);

OpenAcc code starts from the function sweeptraces() under Tools/fitzhugh.c which is called by Fitzhugh_Generic/torus_4.py via Tools/fitzhugh.py

Any help is greatly appreciated.

Thanks,
Krishna.

MatColgrove · June 11, 2015, 12:40am

Hi Krishna,

This is pretty cool. I haven’t tried calling PGI compiled OpenACC from python before. Glad to hear that it works (at least for the 50x50 case). However, I’ll need to get Python 2.7 and Scipy installed on my system before I can recreate the error.

Though typically when I see this type of error it’s due to the array size growing over 2GB. Granted, 60x60 doesn’t sound too big, but try adding “-Mlarge_arrays” to you compile flags to see if it fixes the problem.

“libcupti.so” is a CUDA profile library being brought in when you use the “-ta=tesla:time” option. Set the LD_LIBRARY_PATH environment variable to include “$PGI/linux86-64/2015/cuda/6.5/lib64/” to have it loaded, or remove the “time” option. Note that we’re encouraging folks to not use “time” and instead use the environment variable “PGI_ACC_TIME=1” to enable profiling information. That way, you don’t need recompile to turn it off.

Mat

krishna_pusuluri · June 11, 2015, 4:11am

Hi Mat,

Thanks for the response.
Yes, the array size could be larger than 2gb since for each point in the 60x60 grid a lot of computations are performed. I have tried using -Mlarge_arrays flag but it still doesn’t resolve the issue.

I have added a test case within the C file itself. So you should be able to run the openacc c code directly without python :
From https://bitbucket.org/pusuluri_krishna/genericmotifs/src :
get Tools/fitzhugh.c and Fitzhugh_Generic/isa.txt

In fitzhugh.c specify the location of the downloaded isa.txt file in the main function in this line :
isa = fopen(" … ",“r”);

Also you can change the size of the grid in this line under main() :
sweeptraces(4, initial_states_array, p, coupling_strengths,trajectory_targets_phasedifferences,3600,
(double)3/1000, 21275, 30);

The number ‘3600’ (6th parameter) here specifies the grid 60x60.
This test case should be running fine for all values upto 3600 for this parameter. It runs fine for 2500 but for 3600, it gives 700 error.

Thanks so much again for all the help.

-Krishna.

krishna_pusuluri · June 11, 2015, 5:53am

While trying to debug this, I have come across the following:

#include <stdio.h>
int main(){
double initial_states_array;int n;
printf(“Enter size of grid\n”);
scanf(“%d”,&n);
initial_states_array = (double)malloc(8nnsizeof(double));
initial_states_array[8nn-1]=20;
printf(“ISA last %lf \n”, initial_states_array[8n*n-1]);
}

This code when compiled with gcc, runs fine for any input value , say, 50. But with pgcc the same gives a segmentation fault. What am I doing wrong here?

MatColgrove · June 11, 2015, 3:09pm

Hi Krishna,

But with pgcc the same gives a segmentation fault. What am I doing wrong here?

I think this warning is your issue:

% pgcc -fast testBig.c -V15.5
PGC-W-0155-Pointer value created from a nonlong integral type  (testBig.c: 7)
PGC/x86-64 Linux 15.5-0: compilation completed with warnings

We don’t include stdlib.h by default so malloc’s prototype gets implicitly defined. Including stdlib.h fixes the issue.

% cat testBig.c
#include <stdlib.h>
#include <stdio.h>
 int main(){
 double *initial_states_array;int n;
 printf("Enter size of grid\n");
 scanf("%d",&n);
 initial_states_array = (double*)malloc(8*n*n*sizeof(double));
 initial_states_array[8*n*n-1]=20;
 printf("ISA last %lf \n", initial_states_array[8*n*n-1]);
 }


% pgcc -fast testBig.c -V15.5
% a.out
Enter size of grid
60
ISA last 20.000000

Note that this appears to be part of the problem for the larger case, but there’s more going on. I’m investigating.

Mat

krishna_pusuluri · June 11, 2015, 5:53pm

Hi Mat,

Yes, including stdlib fixes the segmentation fault issue for the short code.

I have also now updated the file fitzhugh.c with the test case so it can be run by itself without need for any other file.
https://bitbucket.org/pusuluri_krishna/genericmotifs/src/
Tools/fitzhugh.c
You can enter the size of the grid ‘n’ as, say, 50 or 60.
For each point in the nxn grid, now it basically gives the same input conditions and essentially performs the same computations for all nxn points.
As before, runs fine for 50x50 but not 60x60.

Thanks,
Krishna.

MatColgrove · June 11, 2015, 6:49pm

The problem with the main code is the “output” array. It’s 85,100 elements and has a “double” data type. Since “output” is privatized, every thread will get it’s own copy. The compiler does this by allocating one large array. For an input size of 60, 3712 threads are created. Hence the “output” array is 85,1003,7128 bytes, or ~2.3GB.

The compiler should be accounting for the large private array when “-Mlarge_arrays” is used, but isn’t for this case. Hence, I’ve added TPR#21720 and sent it to our engineers for further investigation.

The work around is to either limit the number of threads by adding the “num_gangs(16) vector_length(128)” clause to your loop schedule, or manually privatize “output” by adding a second dimension to the array (sized to “noOfInitialStates”) and then copying it to the device.

Hope this helps,
Mat

krishna_pusuluri · June 11, 2015, 9:18pm

Hi Mat,

Great! Manual privatization solves the issue! Finally everything is working now.
Thank you so much for all the help. Really appreciate it.

Regards,
Krishna.

Topic		Replies	Views
call to cuStreamSynchronize returned error 700: Illegal addr Legacy PGI Compilers	1	3499	March 5, 2018
Unexpected cuStreamSynchronize error Legacy PGI Compilers	1	3614	March 27, 2015
Oddity in OpenACC Legacy PGI Compilers	15	13003	November 23, 2015
Error for "call to cuStreamSynchronize returned error 7 Legacy PGI Compilers	4	8244	August 27, 2014
Performance improvement of OpenAcc sequential routine Legacy PGI Compilers	6	4567	July 24, 2015
OpenACC reporting "Illegal address during kernel execut Legacy PGI Compilers	5	14410	January 12, 2017
Illegal address error when passing vector by reference Legacy PGI Compilers	1	1192	April 3, 2019
error: Failing in Thread:1 call to cuStreamSynchronize returned error 700: Illegal address during kernel execution Legacy PGI Compilers	2	2591	August 6, 2019
How should OpenACC handle array size that is only known at run-time Legacy PGI Compilers	8	4091	June 20, 2019
OpenACC loop with "larger steps" Legacy PGI Compilers	1	4063	June 9, 2017

Large grids return cuStreamSynchronize error 700

Related topics