new compiler gives error

Hi,

I have been using fortran compiler version 10.6 to run my linear solver on tesla GPUs. Recently, I installed the trial version of the latest compiler 13.2 and now the very same code is giving not enough memory errors for different problems which I ran with 10.6.
The total memory used by code when run on cpu alone is about 300 MB so I fail to see why the tesla c2050 GPU would run out of memory.

0: ALLOCATE: 2299968 bytes requested; not enough memory: 30(unknown error)

0: ALLOCATE: 1823360 bytes requested; not enough memory: 4(unspecified launch failure)

I do some operations on GPU before these errors occur and the code where I get these errors is,

      real,device,dimension(ni,nj,nk,nb)::d_diff,d_phi

If I make the arrays allocatable then only the ‘not enough memory’ part of the error goes away and the code exits at the allocation statement.

Is there something wrong that I am doing here or is it an issue with the latest compiler.

Thanks,
Amit

Hi Amit,

The newer PGI CUDA Fortran versions do use newer CUDA (4.2, 5.0) which also need newer CUDA Driver versions install. My best guess is that you simply need to update your driver.

What is the output from the ‘pgaccelinfo’ utility? It will tell us what driver version you have installed.

  • Mat

Hi Matt, the output from pgaccelinfo utility is as follows,

CUDA Driver Version: 5000
NVRM version: NVIDIA UNIX x86_64 Kernel Module 304.54 Sat Sep 29 00:05:49 PDT 2012

CUDA Device Number: 0
Device Name: Tesla C2050
Device Revision Number: 2.0
Global Memory Size: 2817982464
Number of Multiprocessors: 14
Number of Cores: 448
Concurrent Copy and Execution: Yes
Total Constant Memory: 65536
Total Shared Memory per Block: 49152
Registers per Block: 32768
Warp Size: 32
Maximum Threads per Block: 1024
Maximum Block Dimensions: 1024, 1024, 64
Maximum Grid Dimensions: 65535 x 65535 x 65535
Maximum Memory Pitch: 2147483647B
Texture Alignment: 512B
Clock Rate: 1147 MHz
Execution Timeout: No
Integrated Device: No
Can Map Host Memory: Yes
Compute Mode: default
Concurrent Kernels: Yes
ECC Enabled: Yes
Memory Clock Rate: 1500 MHz
Memory Bus Width: 384 bits
L2 Cache Size: 786432 bytes
Max Threads Per SMP: 1536
Async Engines: 2
Unified Addressing: Yes
Initialization time: 5746011 microseconds
Current free memory: 2755256320
Upload time (4MB): 1342 microseconds ( 805 ms pinned)
Download time: 1150 microseconds ( 933 ms pinned)
Upload bandwidth: 3125 MB/sec (5210 MB/sec pinned)
Download bandwidth: 3647 MB/sec (4495 MB/sec pinned)
PGI Compiler Option: -ta=nvidia,cc20

CUDA Device Number: 1
Device Name: Tesla C2050
Device Revision Number: 2.0
Global Memory Size: 2817982464
Number of Multiprocessors: 14
Number of Cores: 448
Concurrent Copy and Execution: Yes
Total Constant Memory: 65536
Total Shared Memory per Block: 49152
Registers per Block: 32768
Warp Size: 32
Maximum Threads per Block: 1024
Maximum Block Dimensions: 1024, 1024, 64
Maximum Grid Dimensions: 65535 x 65535 x 65535
Maximum Memory Pitch: 2147483647B
Texture Alignment: 512B
Clock Rate: 1147 MHz
Execution Timeout: No
Integrated Device: No
Can Map Host Memory: Yes
Compute Mode: default
Concurrent Kernels: Yes
ECC Enabled: Yes
Memory Clock Rate: 1500 MHz
Memory Bus Width: 384 bits
L2 Cache Size: 786432 bytes
Max Threads Per SMP: 1536
Async Engines: 2
Unified Addressing: Yes
Initialization time: 5746011 microseconds
Current free memory: 2755256320
Upload time (4MB): 1503 microseconds ( 804 ms pinned)
Download time: 1211 microseconds ( 933 ms pinned)
Upload bandwidth: 2790 MB/sec (5216 MB/sec pinned)
Download bandwidth: 3463 MB/sec (4495 MB/sec pinned)
PGI Compiler Option: -ta=nvidia,cc20

Nope, I’m wrong. That’s a current driver.

What happens if you run in debug mode in emulation (-Mcuda=emu)? Do you see similar or the problems?

If not, then I’ll need to see a reproducing example to determine what’s wrong. If it’s too big to post, please send a note to PGI Customer Service (trs@pgroup.com) and ask them to forward the example to me.

Thanks,
Mat

Hi Mat,

I get 100s of such warnings in the emulator mode and it finally fails.
The code does go beyond the point where it failed on the GPU though.

Warning: Number of emulated threads (14) is less than available cpus (24)
Warning: Number of emulated threads (14) is less than available cpus (24)
Error: _mp_task_yield/_mp_task_sync does not work in this case
a region with one thread
a nested task
an immediate task

Thanks,
Amit

Warning: Number of emulated threads (14) is less than available cpus (24)

This means that the number of threads that you’re launching is very small (14) and the default is to use all cores. The solution is to set environment variable “OMP_NUM_THREADS=14” to limit the number of threads spawned.

That begs the question, why does your program use so few threads?

  • Mat

I use OMP_NUM_THREADS=1 by default. I changed it to OMP_NUM_THREADS=14 and tried running the code in emulator mode but still the same issue. I get 100s of warnings and then the code errors out with this message, Error: _mp_task_yield/_mp_task_sync does not work in this case.

This is a small test case which I am using to make sure that the new compiler runs without any issues and thus the small number of threads.

Let me repeat, this case runs without any problem with PGI version 10.6 on GPU. This is a large FORTRAN code with ~200K lines.

~Amit

Let me repeat, this case runs without any problem with PGI version 10.6 on GPU. This is a large FORTRAN code with ~200K lines.

I understand but unfortunately without a reproducing example, issues like these are very difficult for me to determine. I simply don’t have enough information and can only make guesses.

  • Mat

I think I can send you the tarball with the executable and the test case.

~Amit

That would be great. If it’s too big to email, please FTP. See: https://www.pgroup.com/support/ftp_access.php

-Mat

Mat

I have uploaded the file (amitamritkar.tar) to the FTP site.

Thanks

Ok, thanks. I’ve asked our web master to grab it for me, but it wont be till Monday before I’ll be able to look at it.

  • Mat

Hi Amit,

I ran your exe on three different systems, the two Kepler systems, a GTX690 and K20, ran successfully:

 entering krylov in xmom
 entering krylov in xmom
 entering krylov in xmom
 entering krylov in xmom
 entering exchange_var in x-momemtum
 after momentum,u,v,w
 entering pressure
 exiting pressure
Warning: ieee_inexact is signaling
FORTRAN STOP

Though I see the failure on a Fermi M2090:

% mpirun -np 1 ./gpu.x
 entering alloc_fieldvar
 entering gpu alloc_
 entering gpu_data_transfer
 exiting gpu_data_transfer
 entering momentum
 entering diff_coeff
 come out of diff_coeff
 entering xmom
 enering diffuse in xmom
0: ALLOCATE: 1823360 bytes requested; not enough memory: 4(unspecified launch failure)

So at least we know it’s not a driver issue but having to do with the compute capability of the binary.

What flags did you use compile the code?
What happens if you compile using Cuda 5.0 (i.e. -Mcuda=5.0)? Cuda 4.1 (i.e. -Mcuda=4.1)?

I’ll ask around to see if anyone else has ideas.

  • Mat

Hi Mat,

What flags did you use compile the code?
What happens if you compile using Cuda 5.0 (i.e. -Mcuda=5.0)? Cuda 4.1 (i.e. -Mcuda=4.1)?

My compiler flags were, -fast -Mr8 -Mcuda -Minfo=all -ta=nvidia,cc20
When I used the flag -Mcuda=5.0 there was no error and the code ran correctly.

So, I just have to explicitly specify -Mcuda=5.0 for Tesla cards.

Thanks for your help.
~Amit