large input value sets

vibrantcascade · February 7, 2013, 2:44am

I’m currently running cuda fortran and my code’s worst case scenario needs to generate 6 integers as inputs for every cuda thread.

Currently I have 6 arrays of integers with 2048 integers per array, and before my call to the GPU I call a global subroutine in the GPU kernel to set 6 constant arrays of 2048 integers each equal to the input arrays. (I believe this loads them into high speed read only texture memory if I remember correctly.) I then call the GPU with a 2048 member array of doubles to get the results. Then I generate the next set of 2048 input values and repeat.

The GPU only takes about 2 to 8 seconds to complete the 2048 threads, and as a result is constantly doing I/O and wasting a lot of time. I’d like to pass say 10,000+ threads at a time to get better performance as these calculations run for weeks overall, but it appears that 6 arrays of integers with 2048 integers each uses up all of the 48k or so of read only memory and I get insufficient memory errors if I increase the number of values in the 6 input arrays much past this.

So is there a way to use the 3+ gigs of main videocard memory to load up more values or stream in groups of new values when the old values finish to save myself the 200ms or so of IO time I’m wasting every few seconds and let the GPU churn away longer?

I have access to both a fermi C2050 GPU and a GTX680 GK104 if it matters.

I’m assuming come July when I get my hands on a GK110 which can have kernels that call kernels I’ll be able to fix this by simply making the main GPU calling loop into another kernel. But I’m wondering if I can do anything for older GPUs.

Thanks!
Morgan

MatColgrove · February 7, 2013, 4:17pm

(I believe this loads them into high speed read only texture memory if I remember correctly.)

Close, it’s actually put in constant memory, not texture.

So is there a way to use the 3+ gigs of main videocard memory to load up more values or stream in groups of new values when the old values finish to save myself the 200ms or so of IO time I’m wasting every few seconds and let the GPU churn away longer?

Sure, there’s nothing in CUDA Fortran that limits you in terms of memory size. You may need to add the flag “-Mlarge_arrays” if an individual array is over 2GB. The limiting factor will be your card’s memory. To see the limits of your cards, use the utility “pgaccelinfo”.

Note that while constant memory size could vary, it’s typically only 64k. So you’d need to move your constant arrays over to global memory.

Mat

vibrantcascade · February 8, 2013, 4:57pm

When I try to remove the “attributes(constant)” flag I start getting this message:

PGF90-S-0520-Host MODULE data cannot be used in a DEVICE or GLOBAL subprogram - m1aryd (i4six6oddCuda.f: 31)

If I’m not using constant data can I only pass data to the kernel through the main global subroutine I call to spawn the threads and run on the gpu as a local array? I figured when I removed the constant flag it would simply load the array data into the larger 3 gigs of memory on the card instead of the 64k of constant memory when I made the call and passed the kernel off to the gpu?

I was doing something like this before when using constants:

module i4six6oddcuda
c making variables local to module
double precision, dimension(0:500) :: factrfD
double precision, dimension(0:170) :: factD
integer, dimension(1:2048) :: m1AryD
integer, dimension(1:2048) :: n1AryD
integer, dimension(1:2048) :: p1AryD
integer, dimension(1:2048) :: q1AryD
integer, dimension(1:2048) :: s1AryD
integer, dimension(1:2048) :: t1AryD
attributes(constant) :: factrfD,factD,m1AryD,n1AryD,p1AryD
attributes(constant) :: q1AryD,s1AryD,t1AryD
contains

subroutine setMNPQSTarrays(m1Ary,n1Ary,p1Ary,q1Ary,s1Ary,t1Ary)
integer, dimension(1:2048) :: m1Ary
integer, dimension(1:2048) :: n1Ary
integer, dimension(1:2048) :: p1Ary
integer, dimension(1:2048) :: q1Ary
integer, dimension(1:2048) :: s1Ary
integer, dimension(1:2048) :: t1Ary
m1AryD = m1Ary
n1AryD = n1Ary
p1AryD = p1Ary
q1AryD = q1Ary
s1AryD = s1Ary
t1AryD = t1Ary
end subroutine setMNPQSTarrays

MatColgrove · February 8, 2013, 5:32pm

You probably forgot to add the “device” attribute. Without it, the module data is a host side variable and can’t be used on the device.

Mat