Dazed and Confused..

Hello again everyone,

I’m writing this because I’m a little confused about some of the hardware implications involved with the CUDA paradigm.

Specifically, I’ve got an application that uses (actually, ‘exploits’ would be a better word) both the CPU, and all available CUDA-capable GPUs that it can find, all at essentially the same time.

All very well and good. But it is required of this particular app that both the CPU, and all of the available GPUs, work exclusively with system memory (at least, for all intents and purposes). Specifically, there is a little over two gigabytes of system memory that all the GPUs work with, and the same amount of system memory that the CPU works with. And the whole idea is that the CPU can do it’s work while all of the GPUs are busy doing theirs.

So there is a little over four gigabytes of system memory being accessed by both the CPU, and all available GPUs - i.e. way more than can be provided by current “on-board” memory configurations…

The program does work, but I have this nagging suspicion that the CPU isn’t really doing anything while the GPUs are busy, because the GPUs are essentially ‘hogging the bus’.

I’m not a hardware person, but I think that the problem is that the system bus is being overwhelmed by all the requests for system memory that all of the processors are making.

So I did a little reading on the Internet about microcomputer buses, and found out (correct me if I’m wrong here) that there are, in fact, two buses - one for the CPU (called the “local bus”), and one for the PCI-e interface (called the “external” or “peripheral” bus).

So theoretically (based on what little I know), there should be little to no conflict between CPU memory accesses, and ‘peripheral’ (aka: GPU) memory accesses, assuming that they’re not accessing the same memory (and in my application, they’re not).

Unfortunately however, that’s not what I’m seeing. When I ‘crank up’ the app to the point where all of the GPUs need to spend a lot of time processing (on the order of minutes), I’ve noticed that the whole system slows down to an absolute crawl. Just iconizing a window can take upwards of 30 seconds to complete while the GPUs are busy doing their thing!

But that doesn’t make much sense to me. I thought the whole idea of having separate processors was so that they can be busy calculating whatever they’re given at the same time as the CPU is calculating whatever it needs to do. But if the CPU has to sit there and wait for all of the GPUs to finish accessing memory (or vice-versa), then where’s the benefit?

So my first question is obviously, is that actually the case? Do GPUs only perform true ‘multiprocessing’ when they’re accessing their own on-board memory (or that of other GPUs) ? And when they’re not, they essentially ‘shut down’ the CPU’s access to system memory?

Obviously, I’m hoping that someone can tell me that this isn’t the case, but I’m getting this sinking feeling that it is…

Which leads me to my second question, which is: Is there hardware out there that anyone knows about, that attempts to either solve, mitigate, or at the very least, address this problem?

For example, I’ve noticed that the new K20 boards are advertising the fact that they have “Two DMA Engines”, which NVidia is saying makes them better suited for “parallel computing” than the GeForce 680, which only has one. But then again, the GeForce 680 supports PCI-e gen 3.0, which is twice the speed of what is supported by the K20 board. So does any of that make any kind of difference in regards to concurrent accesses to system memory?

Also, further reading on the Internet has illuminated me to the fact that what I’m looking for is what’s called a “symmetric multiprocessing (SMP) architecture”. Does the CUDA programming paradigm claim to be in that class?

So many questions, so little space - are NUMA systems still being made? What about HyperTransport busses? Are there plans to release an HTX3-compatible K20? Are there even any HTX3 busses out there? DMA, PCI-e 3.0, DDR3, clock speeds, memory speeds, bus speeds, CPU speeds, speed kills, need pills, ahhhh!! Brain overload…

Anyway, I think I may be getting ahead of myself here. Truth be told, all I’ve got at the moment is a single GTX 525M inside of a Dell Inspiron laptop. More than enough to develop the software, to be sure (kudos to NVidia for that, BTW), but my software wasn’t developed to be run on just that. And because the software is essentially finished at this point, I’m seriously looking into what kind of hardware I need to purchase to ensure the results I’m after. Hence the myriad questions…

So thanks for listening, and happy computing!

If you are running the GPU using only pinned system memory vs. using the GPU device memory you are reducing the memory bandwidth of the GPU from 208 GB/s to 4-6 GB/s bi-directional. This may not be the best design decision.

Tesla and Quadro GPUs have had 2 DMA engines since Fermi. GeForce GPUs are limited to 1 DMA engine. Multiple DMA engines improve concurrency between host to device and device to host transfers but require explicit use of CUDA streams for the application to see the benefit.

You should profile both the GPU code and the CPU code to determine the amount of memory bandwidth you are using before making conclusions.

Dont use the system memory for the GPU computing.

You must follow this procedure:

1.) Copy your data on the GPU memory (from system memory to GPU memory)

2.) Let the GPU do its job (e.g. “Run the KERNEL”)

3.) Copy the results back (from GPU memory to system memory)

And try to minimise the PCI bus data transfer in general because pci bus bandwidth sucks hard compared to the hardware connection "GPU Core ---- GPU (“Device”) Memory).

Thanks for the replies.

Ming: The whole purpose of using GPUs in the first place is to multiprocess approximately two gigabytes of memory. If there are 2,000 threads, then each thread will be responsible for processing (concurrently) exactly 1/2,000th of that 2 gigabyte memory area. If there are 10,000 threads, then each thread will be responsible for processing exactly 1/10,000th of that same 2 gigabyte memory area. And if there are 64,000 threads, well, you get the idea.

So how do I copy 2 gigabytes worth of data into 6 megabytes of on-board memory? Obviously, I cannot.

So what I think you’re suggesting is that I break up the task into about 340 sequential tasks (2 gig / 6 meg), each of which will only (indeed, can only) “multitask” 6 megabytes of on-board memory at a time. With all due respect, is that really a better solution? I mean, even if what Greg said is right, that the GPU can process device memory 42 times faster than system memory, that still means that what you’re suggesting will be 340/42 = 8 times slower than what I’ve got now. And that’s assuming that the program can transfer system memory to on-board memory instantaneously, which it cannot.

But it gets even worse when you consider that the device memory available to use for the target data is actually far less than the board’s total device memory, because after all’s said and done, about half of that will get used up by the kernel, its registers, its stack, the driver, and any thread-id-specific data pointer arrays required to keep track of what thread processes what data (especially in 64-bit). Not to mention any employed ECC, which I understand eats up another 10% of the device memory…

So now we’re talking about 680 sequential tasks to do the job of the original one multiprocessed task. Which, according to the previous logic, will be (in the ballpark of) 16 times slower than the original. So where is the flaw in this logic?

Greg: “4-6 GB/s bi-directional” doesn’t sound that bad to me (my 525M offers, I think, somewhere around 1.5 MB/s). But my real concern is not the bandwidth of the GPUs - it’s finding hardware that won’t cut the CPU off at the knees if that bandwidth is saturated (which it will be)…

So anyway, just trying to understand this stuff… Didn’t mean to come across as too argumentative or anything… Apologies if it sounded that way to anyone…

Check the current range of GPUs. You will find the top crop to have 6 gigabytes of on-board memory, not 6 megabytes.
Problem solved.

Um, oops…

I may have made a small mistake in my calculations. It would appear that the total amount of device memory on your average high-end card is a whole lot closer to six gigabytes than the six megabytes I was basing my calculations on…

So yeah, it would obviously be better to transfer the whole two gig to the device before launch…

Sorry 'bout that…

Never mind…

Didn’t mean to bypass your post tera - I actually wrote mine before I saw yours - no, really!

So thanks anyway. Now I think I’ll go bang my head against a hard surface for a little while…