I’m currently running cuda fortran and my code’s worst case scenario needs to generate 6 integers as inputs for every cuda thread.
Currently I have 6 arrays of integers with 2048 integers per array, and before my call to the GPU I call a global subroutine in the GPU kernel to set 6 constant arrays of 2048 integers each equal to the input arrays. (I believe this loads them into high speed read only texture memory if I remember correctly.) I then call the GPU with a 2048 member array of doubles to get the results. Then I generate the next set of 2048 input values and repeat.
The GPU only takes about 2 to 8 seconds to complete the 2048 threads, and as a result is constantly doing I/O and wasting a lot of time. I’d like to pass say 10,000+ threads at a time to get better performance as these calculations run for weeks overall, but it appears that 6 arrays of integers with 2048 integers each uses up all of the 48k or so of read only memory and I get insufficient memory errors if I increase the number of values in the 6 input arrays much past this.
So is there a way to use the 3+ gigs of main videocard memory to load up more values or stream in groups of new values when the old values finish to save myself the 200ms or so of IO time I’m wasting every few seconds and let the GPU churn away longer?
I have access to both a fermi C2050 GPU and a GTX680 GK104 if it matters.
I’m assuming come July when I get my hands on a GK110 which can have kernels that call kernels I’ll be able to fix this by simply making the main GPU calling loop into another kernel. But I’m wondering if I can do anything for older GPUs.