The limitation of the bus between the GPU GDDR5 and DDR3 is severe. I am trying to make my program distributed and the bus slow bandwidth takes away all the advantages that distributed programming can give you. I do not say I am deserting it I will try on a server with RDMA GPU-Direct but I am certain I will see no gain by using another K20 but loss.
So future trend is to use fast uniform memory both for GPU and CPU. Like GDDR5. For instance this has been introduced in the new Playstation 4. The processors can see GPU memory and other way round. Right now we are limited both by the PCI Express bus and Network bus in MPI.
We are loosing a lot lot lot efficiency with the current architecture. Way too lot.
In my lectures in CUDA I am not saying to my students to measure the time for memory transfers. I regard the way that things are done now in memory transfer primitive which will change in the future. I am certain in the future we will look back to the way we did memory transfer like the way we are looking at the video games of the 80s.
Well there are some things that can come up good with restrictions. Restrictions in bus limitations makes you think of way you can squeeze and optimize things. This experience is invaluable and I will carry it with me.
Anyway I found a way to make things 19 times faster in my code reducing time in memory transfer and giving more work to the devices and using more collective communications. The problem is that it an iterative process the one I am doing and data need to constantly update. A challenge worth of my time. Lets see.
As I see now with Parallel Programming and new hardware comming out there should be a world-wide consortium which will transfer experience. Summer Schools are quite trendy this time and I will teach this experience I am accumulating to young students. Anyway, my 10 cents.
It would be nice if you discussed a particular solution to your problem(s) so that others can benefit :)
Somewhat related to your points, I came across this article (access to article might require being a member of ACM) that was thought provoking in terms of future algorithm design:
Yes I have reached to the conclusion that accelerating this application with MPI or SLI is not possible. So it is beyond current technology. Of course not in theory. There is a need for uniform memory usage. There is also a need for the compiler to generate code for this case simply by telling how many GPUs you have. Its like OpenMP. A CUBLAS aware of this should be generated and on and on. So my conclusion on this experiment is that there exist applications like my current iterative application that can see no speedup but delay with distributed means using current hardware technology. Unfortunately.
My advice to NVIDIA. In this application I do not need at all host memory. Everything is run entirely on the GPU memory. This application could see a speedup if the card PCB had two or more GPUs with uniform memory, i.e. the GPUs could see the entire memory. So my advice for the new K30 is this. I have no problem with the memory issue it can fit very well in the memory of 5-6GBs. I think its rather lazy hardware architecture the GPUs not being able to communicate with the entire memory but the memory is splitted in two. Lets take for example the simple sapxy of cublas. cublas could actually split in two the vector and add it. Now it takes close to 1.5ms for the addition and daring to do MPI on this delays things. But with two GPUs sharing the same memory it will take 0.75ms. You might say that 1.5ms is insignificant but add 300 iterations and around 20 sapxy(s) this makes it 9secs add all other operations you reach 90secs in a Tesla M2070Q and around 70secs in a K20. Making this time 35 secs it will be quite a deal. Its actually how Ken Kutaragi envisioned things with the CELL architecture, if you cant put it all in a chip make them more but use a very fast bus for interconnection. Simple logic. I am following hardware implementation of parallel architectures for quite a long time.
Anyway my two cents to NVIDIA. :-)
Of course I will try a complete redesign minimizing to the minimum possible the network communication and HOST<->DEVICE memory transfers. Its a great challenge I have 7 GPUs ready to crunch numbers and I am limited with these constraints…lets see if I can break up the entire thing in distributed pieces joining only when it is absolutely necessary and this with the minimum amount of data transfer. Its a challenge indeed and I still have brain cells to burn. :-)
Well my problem is in todays technology. I am saying only this, the rest in a paper and in the Summer School I will teach. Its not a publicity stant but the market dangers of my findings to say too much here. Thanks for reading me.