Hello again everyone,
I’m writing this because I’m a little confused about some of the hardware implications involved with the CUDA paradigm.
Specifically, I’ve got an application that uses (actually, ‘exploits’ would be a better word) both the CPU, and all available CUDA-capable GPUs that it can find, all at essentially the same time.
All very well and good. But it is required of this particular app that both the CPU, and all of the available GPUs, work exclusively with system memory (at least, for all intents and purposes). Specifically, there is a little over two gigabytes of system memory that all the GPUs work with, and the same amount of system memory that the CPU works with. And the whole idea is that the CPU can do it’s work while all of the GPUs are busy doing theirs.
So there is a little over four gigabytes of system memory being accessed by both the CPU, and all available GPUs - i.e. way more than can be provided by current “on-board” memory configurations…
The program does work, but I have this nagging suspicion that the CPU isn’t really doing anything while the GPUs are busy, because the GPUs are essentially ‘hogging the bus’.
I’m not a hardware person, but I think that the problem is that the system bus is being overwhelmed by all the requests for system memory that all of the processors are making.
So I did a little reading on the Internet about microcomputer buses, and found out (correct me if I’m wrong here) that there are, in fact, two buses - one for the CPU (called the “local bus”), and one for the PCI-e interface (called the “external” or “peripheral” bus).
So theoretically (based on what little I know), there should be little to no conflict between CPU memory accesses, and ‘peripheral’ (aka: GPU) memory accesses, assuming that they’re not accessing the same memory (and in my application, they’re not).
Unfortunately however, that’s not what I’m seeing. When I ‘crank up’ the app to the point where all of the GPUs need to spend a lot of time processing (on the order of minutes), I’ve noticed that the whole system slows down to an absolute crawl. Just iconizing a window can take upwards of 30 seconds to complete while the GPUs are busy doing their thing!
But that doesn’t make much sense to me. I thought the whole idea of having separate processors was so that they can be busy calculating whatever they’re given at the same time as the CPU is calculating whatever it needs to do. But if the CPU has to sit there and wait for all of the GPUs to finish accessing memory (or vice-versa), then where’s the benefit?
So my first question is obviously, is that actually the case? Do GPUs only perform true ‘multiprocessing’ when they’re accessing their own on-board memory (or that of other GPUs) ? And when they’re not, they essentially ‘shut down’ the CPU’s access to system memory?
Obviously, I’m hoping that someone can tell me that this isn’t the case, but I’m getting this sinking feeling that it is…
Which leads me to my second question, which is: Is there hardware out there that anyone knows about, that attempts to either solve, mitigate, or at the very least, address this problem?
For example, I’ve noticed that the new K20 boards are advertising the fact that they have “Two DMA Engines”, which NVidia is saying makes them better suited for “parallel computing” than the GeForce 680, which only has one. But then again, the GeForce 680 supports PCI-e gen 3.0, which is twice the speed of what is supported by the K20 board. So does any of that make any kind of difference in regards to concurrent accesses to system memory?
Also, further reading on the Internet has illuminated me to the fact that what I’m looking for is what’s called a “symmetric multiprocessing (SMP) architecture”. Does the CUDA programming paradigm claim to be in that class?
So many questions, so little space - are NUMA systems still being made? What about HyperTransport busses? Are there plans to release an HTX3-compatible K20? Are there even any HTX3 busses out there? DMA, PCI-e 3.0, DDR3, clock speeds, memory speeds, bus speeds, CPU speeds, speed kills, need pills, ahhhh!! Brain overload…
Anyway, I think I may be getting ahead of myself here. Truth be told, all I’ve got at the moment is a single GTX 525M inside of a Dell Inspiron laptop. More than enough to develop the software, to be sure (kudos to NVidia for that, BTW), but my software wasn’t developed to be run on just that. And because the software is essentially finished at this point, I’m seriously looking into what kind of hardware I need to purchase to ensure the results I’m after. Hence the myriad questions…
So thanks for listening, and happy computing!