cuda memory issues

vibrantcascade · January 17, 2012, 7:12pm

I’m running into the following error when trying to compile Cuda and I’m looking for some advice.

[leiderml@ebwilson-mpi ~]$ pgfortran -Mcuda ibe-25Cuda.f
ptxas error : Entry function ‘case8’ uses too much local data (0xbdec bytes, 0x4000 max)
PGF90-F-0000-Internal compiler error. pgnvd job exited with nonzero status code 0 (ibe-25Cuda.f: 611)
PGF90/x86-64 Linux 11.5-0: compilation aborted

So basically it looks like I’m WAY over the memory usage. My module which I’m sending to the GPU has around 500 lines of code between all the functions. So I’m thinking the arrays are what is causing the problem. Although I think this GPU should have enough memory to handle this.

My GPU is:

Device Name: Tesla C2050
Device Revision Number: 2.0
Global Memory Size: 2817982464
Number of Multiprocessors: 14
Number of Cores: 448
Concurrent Copy and Execution: Yes
Total Constant Memory: 65536
Total Shared Memory per Block: 49152

The declaration for case 8 is this: (and I’ll have some explanation afterwards on how I’m using it)

attributes(global) subroutine case8(m1n1SumAry,nsize,iStart,

factrf,fact,m,n,p,q,s,a,b,c,d,ip2,jp2,kp2,lp2)
double precision, dimension(:) :: m1n1SumAry
double precision, dimension(0:500) :: factrf
double precision, dimension(0:170) :: fact
integer, value :: nsize,m,n,p,q,s,ip2,

                               jp2,kp2,lp2,iStart

integer m1,n1,p1,m1mn1,p1min
double precision p1term,p1sum,n1term,n1sum,threej,a,b,c,d

Now factrf and fact are arrays of constant doubles which I’m sending to the GPU. I could compute them on the GPU, but they’re still going to use the same space unless I tremendously slow down the code and recompute every factorial in that array every time it is needed.

m1n1SumAry is the variable size array that is whatever number of threads I pass in to return the values. Perhaps hard-coding the size would help with memory constraints?

So basically my questions come down to:
-Is 500 lines of code between all the functions in my module too much? Or how much can I really fit in a module between code and variables?

-Is the number of variables a problem? Or is it just the size of all the arrays combined with the size of the code? Because after the compiler strips out all the comments and shrinks everything down to machine code I’d think this would fit on the GPU no problem.

*edited, but nobody else has commented yet

vibrantcascade · January 17, 2012, 7:21pm

Ok I sort of figured out how to share arrays, but there aren’t many good examples in the guide so here’s what I’m trying:

module cudaCase8
contains
c double precision, dimension(0:500) :: factrf
c double precision, dimension(0:170) :: fact
c attributes(shared) :: factrf, fact

attributes(global) subroutine case8(m1n1SumAry,nsize,iStart,

factrf,fact,m,n,p,q,s,a,b,c,d,ip2,jp2,kp2,lp2)
double precision, dimension(:) :: m1n1SumAry
double precision, dimension(0:500) :: factrf
double precision, dimension(0:170) :: fact
attributes(shared) :: factrf, fact

If I declare the shared arrays at the module level by removing the comment character before the subroutine and commenting out the declarations in the subroutine I get a severe incorrect sequence error on the actual subroutine header where I pass the arrays in. If I try to declare them shared inside the subroutine (like above) it tells me that the shared attribute is ignored and it’s a dummy argument.

PGF90-S-0070-Incorrect sequence of statements (ibe-25Cuda.f: 8)
0 inform, 0 warnings, 1 severes, 0 fatal for case8
[leiderml@ebwilson-mpi ~]$ pgfortran -Mcuda ibe-25Cuda.f
PGF90-S-0070-Incorrect sequence of statements (ibe-25Cuda.f: 7)
0 inform, 0 warnings, 1 severes, 0 fatal for case8

and in the second case the error is:

PGF90-W-0526-SHARED attribute ignored on dummy argument factrf (ibe-25Cuda.f: 12)
PGF90-W-0526-SHARED attribute ignored on dummy argument fact (ibe-25Cuda.f: 12)

Could someone write or direct me towards a short bit of example code showing how to declare a shared array and populate it with an array passed into the GPU?

MatColgrove · January 17, 2012, 9:05pm

Hi vibrantcascade,

Ok I just figured out how to share arrays so I’ll be doing that. But I’m still wondering about overall memory usage and approximate code size constraints.

Sounds good. Another way to work around this limit is to move your local arrays to global memory (i.e. make them a module variable) and then add an extra dimension or dimensions to the arrays to privatize it (i.e each thread has their own column).

Secondly, you can reduce the number of threads in your thread block until they all fit.

-Is 500 lines of code between all the functions in my module too much? Or how much can I really fit in a module between code and variables?

The number of lines of code doesn’t matter from a resource constraint point of view (I’ve seen a 3600 line kernel before). What does matter is how much memory and registers each kernel uses. So if the 500 lines reuse the same variables, then this can be a big gain. Though, if you use a lot of variables, this will reduce the number of threads that can be used in a block.

Are those constant arrays being fully copied every time I pass them to another function on the GPU?

No. Constant arrays would be passed by reference.

If they are being copied, Is there away to make them global to the whole module like you do with common blocks in normal fortran so I can save space? Or would I be better off recomputing every factorial I need on the fly?

Constant arrays (i.e. those declared with the constant attribute) can only be declared as a module variable or within host code and therefore already global. Did you mean local device arrays instead of constant?

Mat

vibrantcascade · January 18, 2012, 6:28pm

Hi Mat,

When I say constant arrays I mean the values are constant. However, I’m generating the values on the CPU before I send the array off to be used on the device. (So yes, I mean local device arrays. But I’m populating those arrays with an array from the program calling the device.)

When you say:

Constant arrays (i.e. those declared with the constant attribute) can only be declared as a module variable or within host code and therefore already global. Did you mean local device arrays instead of constant?

The “within host code and therefore already global” part, are you saying the device can already see program level variables and arrays within the host code made global by something like the use of “common”?

I’m thinking that you mean something like what I have below in pseudocode, which is essentially what I’m trying to do. Only I’m now getting the error “PGF90-S-0520-Host MODULE data cannot be used in a DEVICE or GLOBAL subprogram - facts1” every time I try to use a module level variable like facts1.

In this case, would the array I’m passing into the device for facts1 in the Start subroutine automatically populate the facts1 module level array? (the dimensions are identical of course) Do I need to do something special to use that module level array in other functions being called into? Because that error message makes it sound like it’s impossible.

module cudaModule
double precision, dimension(0:500) :: facts1
attributes(shared) :: facts1
contains

attributes(global) subroutine Start(SumAry,facts1,a,b,c,d)
double precision, dimension(:) :: SumAry
integer, value :: a,b,c,d

…code…

end

double precision attributes(device) function func1(b)
integer b,i
do i = 0, 500
b = b + facts1(i) * i
enddo
func1 = b
return
end
end module cudaModule

Thanks for the help!
Morgan

MatColgrove · January 18, 2012, 6:57pm

Hi Morgan,

In your first example, the “facts1” was local, hence every thread had their own copy of the array. Given the size, this wastes a lot of memory. In the second example, a “shared” array means that it is shared across all threads within the same thread block. It is not global.

In the final example, you are closer but need to change “shared” to “constant”. A constant array is visible to all threads and is stored in a separate, and fast access, memory. However, constant memory is read-only from the device. While it’s read/write from the host. So slightly modifying your example:

module cudaModule
double precision, dimension(0:500) :: facts1
attributes(constant) :: facts1
contains

attributes(global) subroutine Start(SumAry,a,b,c,d)  ! no need to pass facts1
double precision, dimension(:) :: SumAry
integer, value :: a,b,c,d

....code.....

end

subroutine init_facts1()
   facts1 = 1.1  ! Init facts1 from the host
end subroutine init_facts1

double precision attributes(device) function func1(a,b)
integer a,b,i
do i = 0, 500
b = facts1(i) * a   ! facts1 is read-only
enddo
func1 = b
return
end

end module cudaModule

Could someone write or direct me towards a short bit of example code showing how to declare a shared array and populate it with an array passed into the GPU?

Take a look at the “sgemm.cuf” example file that ships with the compilers. It will be located in the “etc/sample” directory under your PGI compiler installation tree. (For example on my system it’s in /opt/pgi/linux86-64/11.10/etc/samples).

Hope this helps,
Mat

vibrantcascade · January 18, 2012, 9:10pm

Thanks Mike! I see the memory usage shrinking already. (still a ways to go though)

If I wanted a and b to be shared because they’ll always be the same within the threadblock would I essentially do this then?

Although now that I think about it, there’s no way this would work at the block level as the device wouldn’t know what block to assign the values to when called from the host. Do I just pass them in with the initial call and then I don’t have to declare the shared a and b inside of subroutine Start?

module cudaModule
double precision, dimension(0:500) :: facts1
attributes(constant) :: facts1
integer, value :: a,b
attributes(shared) :: a,b
contains

attributes(global) subroutine Start(SumAry,c,d) ! no need to pass facts1, a, b
double precision, dimension(:) :: SumAry
integer, value :: c,d

…code…

end

subroutine init_facts1()
double precision, dimension(0,500) :: factsHost
common facts/factsHost
facts1 = factsHost ! Init facts1 from the host
end subroutine init_facts1

subroutine init_ints(aHost,bHost)
integer aHost,bHost
a = aHost
b = bHost
end subroutine init_ints

double precision attributes(device) function func1(a,b)
integer a,b,i
do i = 0, 500
b = facts1(i) * a ! facts1 is read-only
enddo
func1 = b
return
end

end module cudaModule

Thanks!
Morgan

vibrantcascade · January 18, 2012, 9:22pm

So this rather than the last example?

module cudaModule
double precision, dimension(0:500) :: facts1
attributes(constant) :: facts1
integer, value :: a,b
attributes(shared) :: a,b
contains

attributes(global) subroutine Start(SumAry,a,b,c,d) ! no need to pass facts1
double precision, dimension(:) :: SumAry
integer, value :: c,d

…code…

end

subroutine init_facts1()
double precision, dimension(0,500) :: factsHost
common facts/factsHost
facts1 = factsHost ! Init facts1 from the host
end subroutine init_facts1

double precision attributes(device) function func1()
integer a,b,i
do i = 0, 500
b = facts1(i) * a ! facts1 is read-only
enddo
func1 = b
return
end

end module cudaModule

MatColgrove · January 18, 2012, 10:00pm

Hi Morgan,

The shared attribute can only be applied to local variables. Module variables can only be “device” (i.e stored in device memory, accessible by all threads, read and writable) or “constant” (i.e. stored in constant memory, accessible by all threads, read-only on the device).

If I wanted a and b to be shared because they’ll always be the same within the threadblock would I essentially do this then?

I guess I’m not clear what you’re trying to do here or why you want a and b to be shared.

Since the value of b does change, having it shared could cause coherency problems if it’s not properly guarded. Having it as a device variable would even be worse. It seems to me that passing b by value is your best option.

Since a doesn’t change, putting it in constant memory would be best.

Can you post a more complete example of what you’re trying to do? That might help me give you better advice.

Mat

vibrantcascade · January 19, 2012, 8:00pm

Hi Mat,

Yeah I suppose my example is bad in that case. My actual code is around 5,000 lines of host code which I’ve updated from fortran 77 standards to get it working in openMP while teaching myself fortran along the way. (unfortunately they didn’t teach fortran anymore when I was getting my comp sci degree)

Now I’m converting the slowest section into Cuda as a test. In my actual code I have 10 variables which are basically constant for all of the threads in a block. There’s about 5 device functions in the device code which use the variables, and so if I pass explicitly it’s eating up (10 variables X 5 functions) or 50 variables worth of space per thread from what I’m getting. Now multiply that times either the 32 threads I’m using in a thread block, or the 448 overall threads that I’m making in total and you suddenly have thousands of copies of the same variables which don’t change between the threads in a block. Which is probably a good portion of my memory problems.

(Perhaps I’m a little off on how this works, but from what I’ve read and what you said it appears that there’s memory set aside for every variable passed in every function on the device per thread in a thread block. So If I keep my thread blocks down to the minimum size of 32 and just make more blocks it should help. Which would make sense as that’s how most compilers operate.)

If I made the host code that can call into the GPU multi-threaded so I had 2 different host threads spawning thread blocks, I would get thread blocks with different values for those 10 variables. My arrays that I asked about earlier would be constant for all cases even between thread blocks, but those 10 variables wouldn’t be.

So I guess it comes down to this.

-If I’m running 2 threads in my host code (openMP), and they both happen to be spawning Cuda thread blocks at the same time where all the threads in a block happen to have the same values for those 10 variables, but the 2nd thread is spawning thread blocks for the same call with different values for those 10 variables, and I make those variables constant, will I have a problem? Or can the GPU tell that the constant variables came from a different thread and handle that seamlessly?

It was appearing that because shared variables are shared within a thread block that I might want to use those in this case to reduce memory overhead as I’m still way over.

Thanks!
Morgan

(And sorry for the long posts. There’s a lot that the couple of examples you guys have don’t cover, and things are kind of vague in the programmers guide at times.)

vibrantcascade · January 19, 2012, 9:34pm

Here’s basically what I have now in pseudocode, only tremendously simplified with a few layers removed.

module cudaModule
double precision, dimension(0:500) :: facts1
attributes(constant) :: facts1

contains

attributes(global) subroutine Start(SumAry,a,b,c,d,e,f,g,h,i,j) ! no need to pass facts1
double precision, dimension(:) :: SumAry
double precision, value :: a,b,c,d,e,f,g,h,i,j,i1

i1 = (blockIdx%x-1)*blockDim%x + threadIdx%x
SumAry(i1)=func1(a,b,c,d,e,f,g,h,i,j,other vars)

end

subroutine init_facts1()
double precision, dimension(0,500) :: factsHost
common facts/factsHost
facts1 = factsHost ! Init facts1 from the host
end subroutine init_facts1

double precision attributes(device) function func1(,a,b,c,d,e,f,g,h,i,j)
double precision tot,a,b,c,d,e,f,g,h,i,j
do i = 0, 500
tot=func2(a,b,c,d,e,f,g,h,i,j,other vars)
enddo
…calculations…
func1 = tot
return
end

double precision attributes(device) function func2(,a,b,c,d,e,f,g,h,i,j)
double precision tot,a,b,c,d,e,f,g,h,i,j,tot1,tot2,tot3,…
…code which uses facts1…
tot1=func3(a,b,c,d,e,f,g,h,i,j)
tot2=func3(a,b,c,d,g,h,i,j,f,e)
tot3=func3(a,b,c,d,h,i,e,j,f,g)
…more premutations…
func2 = tot1 + tot2 + tot3 + …
end

double precision attributes(device) function func3(a,b,c,d,e,f,g,h,i,j)
double precision, a,b,c,d,e,f,g,h,i,j
…calculations and variables… uses facts1
end

end module cudaModule

So basically in this example the values of a,b,c,d,e,f,g,h,i,j are constant in " Start, func1, and func2 " and in func3 they have to be passed by value due to the different permutations. In my actual code there’s more layers of passing currently. But basically I want to save the memory overhead of having memory set aside for every variable in " Start, func1, and func2 " times the number of threads. Perhaps I should be setting them to constant and then doing a call on the host to a module subroutine to set them all before I generate the threadblocks?

MatColgrove · January 19, 2012, 10:11pm

Hi Morgan,

You might be interested in a few of our PGinsider articles. In particular, Michael Wolfe’s article Understanding the CUDA Data Parallel Threading Model and mine on Multi-GPU programming with CUDA Fortran.

What you’re currently trying to optimize is the “occupancy” of your program. As you know, each streaming multi-processor has a finite amount of shared memory and registers. These memories are divided up amongst the active threads from a block or blocks. The more memory each thread consumes, the few number of threads that can be running at any given time. The percentage of active threads versus the maximum potential number threads is the occupancy.

To calculate the occupancy, use the information provided by the “-Mcuda=ptxinfo” flag as inputs to NVIDIA’s Occupancy Calculator Spreadsheet (http://developer.download.nvidia.com/compute/cuda/CUDA_Occupancy_calculator.xls). This will be able to help answer the question “given I use N registers, how many threads should I use in my thread block?”.

Note that higher occupancy does not guarantee higher performance. So while it’s a worth while effort to try and reduce the number of registers used, it may or may not be beneficial. You can also try using the the flag “-Mcuda=maxregcount:n” to limit the number of registers allowed per thread.

It sounds like you have both OpenMP threads trying to use the same GPU. While a single host thread/process can have multiple context (i.e. use multiple GPU), a single GPU can not have multiple context (i.e. multiple host threads can’t share a single GPU). While it may “work” sometimes, it is not supported by NVIDIA and may arbitrarily fail.

Newer cards are capable of running multiple kernels, but this is for asynchronous kernels on separate streams. What happens is that the second kernel doesn’t start running till the first starts finishing. As the first stops using a multi-processor, then the next one can begin using it. The two kernels never share a multi-processor and hence you don’t need to worry about local variables contending for registers.

Hope this helps,
Mat

vibrantcascade · February 5, 2012, 9:41pm

Thanks Mat!

That actually answered a lot of my questions and got me going again for the time being!

Morgan

vibrantcascade · March 19, 2012, 9:50pm

Ok, a question on the occupancy calculator.

After compiling with -Mcuda=ptxinfo I received:

123 registers
48896 + 0 bytes lmem
24 + 16 bytes smem
5456 bytes cmem[0]
32 bytes cmem[1]

(I’m running a tesla C2050 which is compute 2.0)

The occupancy calculator only seems to care about registers, block size, and shared memory size.

For shared memory I added up 48896 + 24 + 16 = 48936 bytes
Then added that into the calculator with the 123 registers and 32 threads per block.

According to the calculator I should just barely be able to run 1 warp per multi-processor due to shared memory constraints. Yet I keep getting the error:

ptxas error : Entry function ‘case8’ uses too much local data (0xbf00 bytes, 0x4000 max)
PGF90-F-0000-Internal compiler error. pgnvd job exited with nonzero status code 0 (ibe-25CudaC.f: 643)

(line 643 is the last line of the module containing the cuda functions)

Do I need to add in the cmem or something? Technically my code should be able to run with up to 256 threads per block if the calculator is correct. Or is there something else I’m overlooking?

Thanks,
Morgan

MatColgrove · March 19, 2012, 11:29pm

Hi Morgan,

My guess is that the issue occurs when creating the Compute Capability 1.3 version of your code which only has 16k per thread of local memory. By default, the compiler targets both CC1.3 and CC2.0. Instead, try targeting just CC2.0. If using a 11.x compiler, I’d also use CUDA 4.0. i.e. “-Mcuda=cc20,4.0”

Hope this helps,
Mat

Topic		Replies	Views
Efficient use of shared memory CUDA Programming and Performance	29	4279	December 2, 2019
Transfer-Bound Application Looking for ideas to speed it up CUDA Programming and Performance	36	29320	April 23, 2010
Cuda code performance CUDA Programming and Performance	14	3111	December 16, 2014
Error running simple CUDA Fortran program Legacy PGI Compilers	9	21311	February 26, 2010
Cuda program results are always zero in HW, correct in EMU? CUDA Programming and Performance	35	11118	May 23, 2010
2D reduction using CUDA The use a cuda and cublas library for a 2D simple reduction CUDA Programming and Performance	11	4409	February 7, 2012
Slow local memory, feigned constant memory. coalesced? global? CUDA Programming and Performance	29	7247	January 25, 2010
What can't you do in CUDA that you'd like? Requests for the future CUDA Programming and Performance	407	134557	May 26, 2010
GPU - CPU Performance comparison on string conversion i7 860 3.5GHz beat out NVidia 9800 GT CUDA Programming and Performance	11	10654	January 4, 2011
matrix multiply reduction CUDA Programming and Performance	41	35539	January 15, 2011

cuda memory issues

Related topics