This is a bit of a long post/question, but I’ve been stuck on this for a few weeks…
I have an application which, I believe, is transfer-bound (D2D[global memory] and D2H). As such, I’m looking into methods of tackling these issues.
The overall program structure, in pseudo-code, is as such:
//A BIG array of memory is malloc'd on the GPU. (I'm testing with an array of 1024 doubles, but in the future I'll be storing 1024x1024x1024)
for (int n=0; n<BIG_NUMBER; ++n){
kernelA<<<...>>>(array_ptr, ...);
kernelB<<<...>>>(array_ptr, ...);
if (n%F==0){
// Copy array back to Host
// Print array
}
}
Note:
-
BIG_NUMBER is usually 100,000 (But will most likely range from 50,000 to 10,000,000 in the future)
-
I’d like F to be 1, in a perfect world, but I find that I need to set it to 100 to get a large performance boost without loosing too much information.
When BIG_NUMBER=100,000 and the array has 4096 elements here are the varying run-times with different values of F:
F = 1 —> 567s
F = 10 —> 213s
F = 100 —> 103s
F = inf. —> 91s
So you can see that if I never copy(D2H)+print (F=infinity) then execution time is 91s, compared to printing at each step which results in an execution time of .
Also, my kernelB has ~20 global memory reads at the beginning and end of the functiion (as I need some variables to persist between kernel calls). I found that if I simply comment them out then my execution time (with F=inf) drops from 91s to 1s! [But, who knows, perhaps the compiler is being naughty? {although I have no optimization flags set}]
To pre-emptively answer some questions you may have at this stage:
-
Yes, I require 2 kernels (as I need grid-level synchronization)
-
Yes, I NEED to print results as I will be examining the state of this array at each point in time - not just the final result. Also, the data which the program outputs is then used to generate plots.
So, the way I see it, I can improve my application speed significantly if I can optimize these 2 ‘transfers’. After reading and thinking about this I can only think of 2 ideas:
1. Making sure my memory transactions are coalesced
I’m not sure if they are? I keep re-reading this section in the programming guide and it just won’t sink in. I have an array of doubles (20 elements for each thread) and each thread writes multiple values to a seperate section of memory… (ie. thread0 writers doubles to array[0.19] and so on…)
2. Perhaps using mapped pinned memory (aka. ‘Zero Copy’)
I read the PinnedMemoryAPI white-paper and am not 100% clear of the benefits (or if it’s appropriate for my problem). If I understand correctly I wouldnt need to allocate any global memory on the GPU AND I’d get greater read/write speeds? But I wouldnt be able to control F, the frequency at which this happens?
If anyone else has any other ideas please feel free to mention them :)
Thanks for taking the time to read this.
ADDITIONAL INFO:
-
GPU = TESLA C1060 on a decent i7 system
-
OS = Ubuntu 9.04
-
SDK = 2.3
-
Kernel Profile:
- Kernel ptxas info:
kernelA - ptxas info : Used 124 registers, 104+0 bytes lmem, 28+16 bytes smem, 132 bytes cmem[1]
kernelB - ptxas info : Used 10 registers, 16+16 bytes smem, 4 bytes cmem[1]