out of memory

Hello,

I’m using PGI Accelerator and Fortran. Whenever I run the executable file, I see that the program uses the RAM of the system. In some cases it exhausts all available memory, and the program stops and sometimes a message saying ‘out of memory’ displays. I thought the accelerated program was supposed to use the GPU memory and not all of the system RAM. Any idea what is causing this?

Program info: The program works with several matrices running several iterations. Once results are received from the GPU, one of the matrices is updated with a new value and then all the matrices are sent to the GPU for a new calculation. This process is done several times.

Thank you for any help
BL

Hi BL,

My best guess is that you have a memory leak somewhere. What I’d do is compile without the Accelerator directives enabled, then run the program under Valgrind (http://www.valgrind.org). You could also have an uninitialized variable which is being used as an size of an array (either allocatable, automatic, or a implicit compiler generated temporary array). Valgrind can help with this as well.

  • Mat

Mat,

Thanks for the reply. I do not have a machine to use Valgrind on. Here is a small sample of what my code structure is like:

PROGRAM test
implicit none

real,allocatable:: A(:),B(:),C(:)	  !  arrays

integer i,j,k

!--------------------------------------------------
allocate(A(10*10*10));
allocate(B(10*10*10));
allocate(C(10*10*10));
!--------------------------------------------------

A=0.0;B=1.0; C=2.0;
do k=1,5000
write(*,*) 'step',k
!$acc region
do i=1,1000
    do j=1,1000
        A(i)=A(i)+B(j)+C(j);
    enddo
enddo
!$acc end region
A=A/1000.0;
enddo
write(*,*) 'press Enter key'
read(*,*)
stop
end

When I compile it, I get the following:

C:\Desktop>pgfortran test.f90 -ta=nvidia,time -Minfo
test:
     14, Memory zero idiom, loop replaced by call to __c_mzero4
     17, Generating copy(a(1:1000))
         Generating copyin(b(1:1000))
         Generating copyin(c(1:1000))
         Generating compute capability 1.0 binary
         Generating compute capability 1.3 binary
         Generating compute capability 2.0 binary
     18, Loop is parallelizable
     19, Complex loop carried dependence of 'a' prevents parallelization
         Loop carried dependence of 'a' prevents parallelization
         Loop carried backward dependence of 'a' prevents vectorization
         Inner sequential loop scheduled on accelerator
         Accelerator kernel generated
         18, !$acc do parallel, vector(256) ! blockidx%x threadidx%x
             Using register for 'a'
         19, !$acc do seq(256)
             Cached references to size [256] block of 'b'
             Cached references to size [256] block of 'c'
             CC 1.0 : 9 registers; 2100 shared, 12 constant, 0 local memory byte
s; 100% occupancy
             CC 1.3 : 9 registers; 2100 shared, 12 constant, 0 local memory byte
s; 100% occupancy
             CC 2.0 : 23 registers; 2060 shared, 56 constant, 0 local memory byt
es; 83% occupancy

The code I have posted is meaningless. However, the structure is not. Basically I have a loop that calls the kernel several times. Whenever I run this on the cpu, the memory usage stays constant. Running on the GPU however, taskmanager shows that the program increasingly uses memory. If the number of calls to the kernel is sufficiently high, the program stops execution. Could the cause of this still be a memory leak? Thank you for the help.

Regards
BL

Hi BL,

I was unable to recreate your issue. On Windows the code ran without taskmgr showing any additional memory usage. On Linux, Valgrind showed no memory problems. Hence, it is unclear why you are getting this error.

What is the output from the ‘pgaccelinfo’ command? What compiler version are you using? What version of Windows? Also, please post the exact error you are getting.

Thanks,
Mat

Mat,

Thank you for verifying the code in Valgrind and on your machine. The system I use has Windows Server 2008 with SP2. I am using PGI Workstation with Command Shells 11.1. Is this also the compiler version? If not, how can I check it? The command pgaccelinfo returns the following message:

CUDA Driver Version:           3020

Device Number:                 0
Device Name:                   GeForce GTX 285
Device Revision Number:        1.3
Global Memory Size:            1046151168
Number of Multiprocessors:     30
Number of Cores:               240
Concurrent Copy and Execution: Yes
Total Constant Memory:         65536
Total Shared Memory per Block: 16384
Registers per Block:           16384
Warp Size:                     32
Maximum Threads per Block:     512
Maximum Block Dimensions:      512, 512, 64
Maximum Grid Dimensions:       65535 x 65535 x 1
Maximum Memory Pitch:          2147483647B
Texture Alignment:             256B
Clock Rate:                    1476 MHz
Current free memory:           1007550464
Upload time (4MB):                7 microseconds (   1 ms pinned)
Download time:                    3 microseconds (   2 ms pinned)
Upload bandwidth:              599186 MB/sec (4194304 MB/sec pinned)
Download bandwidth:            1398101 MB/sec (2097152 MB/sec pinned)

I could not get the error message to show. Usually what happens is that the execution stops completely before it can finish. I have also tried running the executable file on a GeForce 9600 GT card on a Windows XP 64bit SP2.
Thank you again for your help.

Regards
BL

Mat?

I have a program near exactly the same as posted above (actually I copied the code on your video tutorial). when m = 3000 it works when m=4000 it does not (and the monitor black screens for a few seconds before crashing). The output of GPUZ shows GPU load going to ~80-90% for a second or so then back to 0%. I have underclocked the GPU to as low as it can go. Similarly I have a large serial code I am trying to parallelize and get the same message as in the below exmple. This is actually what I am trying to do but thought it best to try and recreate the problem with a simpler bit of code.

!Test Program for OMP, Acc and Profiling
!A.Black 26/6/11

      program TestProg
		
		REAL(KIND=4), ALLOCATABLE :: a(:,:),b(:,:),c(:,:)
		REAL(KIND=4) m
		m=4000

		ALLOCATE(a(m,m),b(m,m),c(m,m))
		WRITE(*,*) m

!$acc region
		do j = 1,m
			do i = 1,m
				a(i,j)=0.0
			enddo
			do k = 1,m
				do i = 1,m
					a(i,j)=a(i,j)+b(i,k)*c(k,j)
				enddo
			enddo
		enddo
!$acc end region
		WRITE(*,*) 'done'

      end program TestProg

The error I get is:
“call to cuMemAlloc returned error 2: Out of Memory
CUDA driver version: 4000”

the accel info is:
C:\Program Files\PGI\win64\11.3\bin>pgaccelinfo
CUDA Driver Version: 4000

Device Number: 0
Device Name: GeForce GTX 275
Device Revision Number: 1.3
Global Memory Size: 879034368
Number of Multiprocessors: 30
Number of Cores: 240
Concurrent Copy and Execution: Yes
Total Constant Memory: 65536
Total Shared Memory per Block: 16384
Registers per Block: 16384
Warp Size: 32
Maximum Threads per Block: 512
Maximum Block Dimensions: 512, 512, 64
Maximum Grid Dimensions: 65535 x 65535 x 1
Maximum Memory Pitch: 2147483647B
Texture Alignment: 256B
Clock Rate: 1404 MHz
Current free memory: 788955136

I updated the driver today and no effect
I am also using PG visual fortran iusing the windows env (not commandline)?
its a 64bit application and the OS is Win7 64b

Do you have any suggestions as to what may be going wrong because whatever I do I get the out of memory error and from using GPUZ and back of the envelope calcs of array sizes I dont think this shouldnt be the case.

THANKS
Al

Hi Al,

I’m still not sure what’s wrong here. I’ve tried to recreate the problem here but it seems specific to these GTX devices running on Windows. I only have one GTX system but it runs Linux and has no problems running the code.

Are you able to run an equivalent CUDA C program? Do you have a monitor attached to the device?

  • Mat

Hi Mat,

Thanks for looking at this.
Im only using the GTX 275 so yes using it for day to day windows graphics also, typically <100Mb used. Im sure I got further with this when I was using XP - when looking at the same problem about a year ago. At least I cant remember this being an issue.

I do have another graphics card which I can instal tomorrow for non CUDA use… I see in the manual there is a way to specify which device is used but cant fathom how to set this in the visual studio environment. Is this possible?

For info I asked about the same problem on the Nvid forum. somone stated the black screen (brief) was almost certainly due to memory out of bounds and also suggested using a dedicated card for graphics.
Also said memory usage may be related to over threading - but sonce got further with this last time applied it I dont think this is my problem

Thanks
Al
Im on GMT so its bed time for me

I see in the manual there is a way to specify which device is used but cant fathom how to set this in the visual studio environment. Is this possible?

Yes. See Chapter 13 of the PVF Users Guide (PGI Documentation Archive for Versions Prior to 17.7)

You can also set the device number by calling the runtime routine “acc_set_device_num()”. Basic usage of this can be found in Chapter 10 of the PVF UG or in more detail in the PGI Accelerator Model Reference Guide (http://www.pgroup.com/lit/whitepapers/pgi_accel_prog_model_1.3.pdf)

  • Mat

Im not sure this is going to work.
Ive installed a second (different) graphics card in the machine with the intent to use it for all day to day graphics and the GTX275 for ONLY CUDA.

275 is in the first slot, a fairly standard (new) ~$100 ATI card is in the second slot. I changed the BIOS to use the PSIE-8 (as opposed 16) slot as initial graphics. The machine is now definetly using the ATI card.

In GPUZ the 275 is detected but the sensors are blank and the CUDA/PHYX etc tickboxes are not denoting CUDA enabled. Also pgaccelinfo.exe says no cuda devices detected. I reinstalled the 275 drivers but no effect.

any idea on what im doing wrong?
Its probably exceptionally early in the morning where you are so I will keep trying and post if i get a solution

thanks
Al

Hi Al,

I sent a note to one of our IT guys for help. I guess I should get up to speed on hardware issues, but I thought it would be faster to just ask IT.

  • Mat