new compiler gives error

amitamritkar · February 22, 2013, 5:41pm

Hi,

I have been using fortran compiler version 10.6 to run my linear solver on tesla GPUs. Recently, I installed the trial version of the latest compiler 13.2 and now the very same code is giving not enough memory errors for different problems which I ran with 10.6.
The total memory used by code when run on cpu alone is about 300 MB so I fail to see why the tesla c2050 GPU would run out of memory.

0: ALLOCATE: 2299968 bytes requested; not enough memory: 30(unknown error)

0: ALLOCATE: 1823360 bytes requested; not enough memory: 4(unspecified launch failure)

I do some operations on GPU before these errors occur and the code where I get these errors is,

      real,device,dimension(ni,nj,nk,nb)::d_diff,d_phi

If I make the arrays allocatable then only the ‘not enough memory’ part of the error goes away and the code exits at the allocation statement.

Is there something wrong that I am doing here or is it an issue with the latest compiler.

Thanks,
Amit

MatColgrove · February 22, 2013, 6:33pm

Hi Amit,

The newer PGI CUDA Fortran versions do use newer CUDA (4.2, 5.0) which also need newer CUDA Driver versions install. My best guess is that you simply need to update your driver.

What is the output from the ‘pgaccelinfo’ utility? It will tell us what driver version you have installed.

Mat

amitamritkar · February 22, 2013, 6:42pm

Hi Matt, the output from pgaccelinfo utility is as follows,

CUDA Driver Version: 5000
NVRM version: NVIDIA UNIX x86_64 Kernel Module 304.54 Sat Sep 29 00:05:49 PDT 2012

CUDA Device Number: 0
Device Name: Tesla C2050
Device Revision Number: 2.0
Global Memory Size: 2817982464
Number of Multiprocessors: 14
Number of Cores: 448
Concurrent Copy and Execution: Yes
Total Constant Memory: 65536
Total Shared Memory per Block: 49152
Registers per Block: 32768
Warp Size: 32
Maximum Threads per Block: 1024
Maximum Block Dimensions: 1024, 1024, 64
Maximum Grid Dimensions: 65535 x 65535 x 65535
Maximum Memory Pitch: 2147483647B
Texture Alignment: 512B
Clock Rate: 1147 MHz
Execution Timeout: No
Integrated Device: No
Can Map Host Memory: Yes
Compute Mode: default
Concurrent Kernels: Yes
ECC Enabled: Yes
Memory Clock Rate: 1500 MHz
Memory Bus Width: 384 bits
L2 Cache Size: 786432 bytes
Max Threads Per SMP: 1536
Async Engines: 2
Unified Addressing: Yes
Initialization time: 5746011 microseconds
Current free memory: 2755256320
Upload time (4MB): 1342 microseconds ( 805 ms pinned)
Download time: 1150 microseconds ( 933 ms pinned)
Upload bandwidth: 3125 MB/sec (5210 MB/sec pinned)
Download bandwidth: 3647 MB/sec (4495 MB/sec pinned)
PGI Compiler Option: -ta=nvidia,cc20

CUDA Device Number: 1
Device Name: Tesla C2050
Device Revision Number: 2.0
Global Memory Size: 2817982464
Number of Multiprocessors: 14
Number of Cores: 448
Concurrent Copy and Execution: Yes
Total Constant Memory: 65536
Total Shared Memory per Block: 49152
Registers per Block: 32768
Warp Size: 32
Maximum Threads per Block: 1024
Maximum Block Dimensions: 1024, 1024, 64
Maximum Grid Dimensions: 65535 x 65535 x 65535
Maximum Memory Pitch: 2147483647B
Texture Alignment: 512B
Clock Rate: 1147 MHz
Execution Timeout: No
Integrated Device: No
Can Map Host Memory: Yes
Compute Mode: default
Concurrent Kernels: Yes
ECC Enabled: Yes
Memory Clock Rate: 1500 MHz
Memory Bus Width: 384 bits
L2 Cache Size: 786432 bytes
Max Threads Per SMP: 1536
Async Engines: 2
Unified Addressing: Yes
Initialization time: 5746011 microseconds
Current free memory: 2755256320
Upload time (4MB): 1503 microseconds ( 804 ms pinned)
Download time: 1211 microseconds ( 933 ms pinned)
Upload bandwidth: 2790 MB/sec (5216 MB/sec pinned)
Download bandwidth: 3463 MB/sec (4495 MB/sec pinned)
PGI Compiler Option: -ta=nvidia,cc20

MatColgrove · February 22, 2013, 6:52pm

Nope, I’m wrong. That’s a current driver.

What happens if you run in debug mode in emulation (-Mcuda=emu)? Do you see similar or the problems?

If not, then I’ll need to see a reproducing example to determine what’s wrong. If it’s too big to post, please send a note to PGI Customer Service (trs@pgroup.com) and ask them to forward the example to me.

Thanks,
Mat

amitamritkar · February 22, 2013, 7:43pm

Hi Mat,

I get 100s of such warnings in the emulator mode and it finally fails.
The code does go beyond the point where it failed on the GPU though.

Warning: Number of emulated threads (14) is less than available cpus (24)
Warning: Number of emulated threads (14) is less than available cpus (24)
Error: _mp_task_yield/_mp_task_sync does not work in this case
a region with one thread
a nested task
an immediate task

Thanks,
Amit

MatColgrove · February 22, 2013, 9:39pm

Warning: Number of emulated threads (14) is less than available cpus (24)

This means that the number of threads that you’re launching is very small (14) and the default is to use all cores. The solution is to set environment variable “OMP_NUM_THREADS=14” to limit the number of threads spawned.

That begs the question, why does your program use so few threads?

Mat

amitamritkar · February 22, 2013, 9:55pm

I use OMP_NUM_THREADS=1 by default. I changed it to OMP_NUM_THREADS=14 and tried running the code in emulator mode but still the same issue. I get 100s of warnings and then the code errors out with this message, Error: _mp_task_yield/_mp_task_sync does not work in this case.

This is a small test case which I am using to make sure that the new compiler runs without any issues and thus the small number of threads.

Let me repeat, this case runs without any problem with PGI version 10.6 on GPU. This is a large FORTRAN code with ~200K lines.

~Amit

MatColgrove · February 22, 2013, 10:35pm

Let me repeat, this case runs without any problem with PGI version 10.6 on GPU. This is a large FORTRAN code with ~200K lines.

I understand but unfortunately without a reproducing example, issues like these are very difficult for me to determine. I simply don’t have enough information and can only make guesses.

Mat

amitamritkar · February 22, 2013, 11:42pm

I think I can send you the tarball with the executable and the test case.

~Amit

MatColgrove · February 22, 2013, 11:45pm

That would be great. If it’s too big to email, please FTP. See: https://www.pgroup.com/support/ftp_access.php

-Mat

amitamritkar · February 23, 2013, 12:40am

Mat

I have uploaded the file (amitamritkar.tar) to the FTP site.

Thanks

MatColgrove · February 23, 2013, 12:44am

Ok, thanks. I’ve asked our web master to grab it for me, but it wont be till Monday before I’ll be able to look at it.

Mat

MatColgrove · February 25, 2013, 11:03pm

Hi Amit,

I ran your exe on three different systems, the two Kepler systems, a GTX690 and K20, ran successfully:

 entering krylov in xmom
 entering krylov in xmom
 entering krylov in xmom
 entering krylov in xmom
 entering exchange_var in x-momemtum
 after momentum,u,v,w
 entering pressure
 exiting pressure
Warning: ieee_inexact is signaling
FORTRAN STOP

Though I see the failure on a Fermi M2090:

% mpirun -np 1 ./gpu.x
 entering alloc_fieldvar
 entering gpu alloc_
 entering gpu_data_transfer
 exiting gpu_data_transfer
 entering momentum
 entering diff_coeff
 come out of diff_coeff
 entering xmom
 enering diffuse in xmom
0: ALLOCATE: 1823360 bytes requested; not enough memory: 4(unspecified launch failure)

So at least we know it’s not a driver issue but having to do with the compute capability of the binary.

What flags did you use compile the code?
What happens if you compile using Cuda 5.0 (i.e. -Mcuda=5.0)? Cuda 4.1 (i.e. -Mcuda=4.1)?

I’ll ask around to see if anyone else has ideas.

Mat

amitamritkar · February 26, 2013, 11:45pm

Hi Mat,

What flags did you use compile the code?
What happens if you compile using Cuda 5.0 (i.e. -Mcuda=5.0)? Cuda 4.1 (i.e. -Mcuda=4.1)?

My compiler flags were, -fast -Mr8 -Mcuda -Minfo=all -ta=nvidia,cc20
When I used the flag -Mcuda=5.0 there was no error and the code ran correctly.

So, I just have to explicitly specify -Mcuda=5.0 for Tesla cards.

Thanks for your help.
~Amit

Topic		Replies	Views
CUDA Fortran Error Legacy PGI Compilers cuda	2	795	July 31, 2020
0: cudaMalloc: 4096 bytes requested; not enough memory: 700(an illegal memory access was encountered) Legacy PGI Compilers	1	3856	January 20, 2020
not enough memory Legacy PGI Compilers	12	9612	December 27, 2010
out of memory Legacy PGI Compilers	10	7773	April 28, 2011
cuda fortran questions Legacy PGI Compilers	10	11058	July 27, 2012
Error changing cuda version and PGI version Legacy PGI Compilers	3	2811	January 24, 2018
FAILED to free megabyte array on device Legacy PGI Compilers	3	2852	October 30, 2012
help :ALLOCATE: 5971968 bytes requested; not enough memory Legacy PGI Compilers	5	10501	October 5, 2006
CUDA Fortran Book Memory Allocation Error Legacy PGI Compilers	5	4042	April 18, 2020
DEALLOCATE: an illegal memory access was encountered Legacy PGI Compilers	2	3119	July 11, 2018

new compiler gives error

Related topics