Can a CUDA kernel read "mapped, pinned" host memory through a "Device Pointer"?

Hello all,

I don’t know how basic a question this is, so please bear with me…

If I’m not on a system that has Unified Addressing enabled, and I allocate a block of “pinned” Host memory (using the Driver API), then get a “Device Pointer” to that memory (using the Driver API), can I pass that value (the value of the “Device Pointer”) to a kernel (using whatever mechanism) and expect that it will be able to read the value (or values) at that address?

In short, will a running kernel be able to correctly interpret a “CUdeviceptr” value as a valid memory address?

The docs strongly imply that it can, when it states:

“…memory that is page-locked and accessible to the device”, and
“Since the memory can be accessed directly by the device,…”

                                    CUDA Toolkit Reference Manual, cuMemHostAlloc()

If it can, is it assumed by the kernel to be a “generic” address that points to memory in the “global” state space?

If it can’t, well…

The docs say:

“Unified addressing is automatically enabled in 64-bit processes on devices with compute capability greater than or equal to 2.0.”

…and…

“Unified addressing is not yet supported on Windows Vista or Windows 7 for devices that do not use the TCC driver model.”

So if there really is no way to pass a pointer to “mapped, pinned host memory” that a kernel can correctly interpret as a valid pointer, is there a way to turn on the “TCC model” in Windows 7, even though my GPU is not a Tesla?

Thanks in advance…

Certainly in the Runtime API, what you describe is the correct approach and I have used it on devices before unified addressing existed. The process was simply: Allocate a pinned host memory region with cudaHostAlloc(), then get a driver pointer with cudaHostGetDevicePointer(), and then pass that pointer to the kernel just like a normal device pointer.

I have no experience with the Device API, so I don’t know how CUdeviceptr works (it appears to be typedef’d to an unsigned integer…). However, if the Runtime API is any indication, what you describe should also work with the Device API.

Thank you, thank you, thank you, Mr. Seibert,

You’ve restored my confidence in the whole CUDA paradigm, believe it or not…

I was VERY worried that the “CUdeviceptr” value would be useless in the context of a running kernel.

In which case, to make a long story short, much, if not all, of my current efforts would have been wasted, to say the least…

That said, I still have to express my trepidation that both you and I may still be mistaken. I’ve spent a shameful amount of time developing software that is in large part based on the assumption that a “CUdeviceptr” pointer to “mapped. pinned” host memory can be successfully accessed by a running kernel… If that assumption is wrong, well, as the song says, more fool me…

However, armed with your reassurance, I’m going to continue developing this monolithic software on the assumption that a “CUdeviceptr” pointer to “mapped, pinned” host memory can be successfully accessed by a running kernel…

Thanks again. I’ll endevour to let you know if it all blows up in my face (just kidding… actually, no, not really)…

seibert is correct, don’t worry :-)

To pass appropriate pointer to kernel, I use code similar to the following (in case of a pointer put in the .cu source file):

void *buffer; // Allocated with cuMemHostAlloc
CUdeviceptr mapped_ptr, p;  // p can be also a kernel launch parameter
size_t bytes = 0;

cuMemHostGetDevicePointer(&mapped_ptr, buffer, 0);
cuModuleGetGlobal(&p, &bytes, kernel, "nameOfTheDevPointerInCUDACode");// Put Your name from the .cu file here
cuMemcpyHtoD(p, &mapped_ptr, bytes);
...
//And in kernel code there goes the pointer...
__device__ void *nameOfTheDevPointerInCUDACode;

Hope this helps,
MK

Thanks cmaster, that definitely does help. And thanks Jimmy for the reassurance.

As it turns out, I’m passing the “device pointer” to my kernel in exactly the same way as your code demonstrates, except that I’m doing it using the Driver API, and there are slightly over three quarters of a million of them contained in three (kernel) arrays, each pointing to a block of “mapped, pinned” memory…

Why so many? Because each element in all three arrays represents the data for a single thread, and my project demands that I “saturate” all available GPUs at all times…

Speaking of which, there is another, much smaller issue that’s come up, having to do with the whole issue of “saturating” a GPU…

I hope you’ll excuse me for going “off thread” on this (no pun intended), but it is kind of related (well, okay, not really, but it does relate to the same project)…

On page 158 of the “CUDA C Programming Guide Version 4.0”, there’s a table that, in essence, states that, regardless of a GPU’s “compute capability”, the “Maximum number of resident blocks per multiprocessor” is always 8.

It also states that the “Maximum number of threads per block” for compute capability “2.x” is 1024.

So, since my GPU has a “compute capability” of 2.1, to get the “Maximum number of resident threads per multiprocessor” (for “saturation” purposes), I should be able to multiply the two numbers together to get: 8 * 1024 = 8192.

But here’s the kicker: If I query the GPU itself (using the Driver API function, “cuDeviceComputeCapability”), I can actually ask it for the “Maximum Threads per Multiprocessor” (CU_DEVICE_ATTRIBUTE_MAX_THREADS_PER_MULTIPROCESSOR)…

When I do that however, the GPU reports that the “Maximum Threads per Multiprocessor” is 1536.

So which number is correct? Is this a misnomer or something? Is it actually referring to the maximum number of threads per CUDA Core, or something along those lines?

So the only logical conclusion, of course, is that my GPU is incapable of basic multiplication.

Should this worry me? ;)

I don’t have the old CUDA 4.0 documentation handy. I am looking at the most recent CUDA C Programming Guide, Appendix F, and the table there which describes the different compute capabilities correctly shows “Maximum number resident threads per multiprocessor” as 1536 for capability 2.x. The maximum number of threads per thread block is 1024 for compute capability 2.x. And the “maximum number of resident blocks per multiprocessor” is correctly stated as 8 in the table.

These three different limitations all apply simultaneously, so to get the maximum number of blocks running they cannot be of the maximum allowed block size. For full occupancy on compute capability 2.0 (i.e. 1536 threads) one typically targets six blocks of 256 threads each, or eight blocks of 192 threads each. How many blocks can actually run concurrently on a multiprocessor is then a question of register and shared memory usage for each thread block.

Thanks very much for that reply Mr. Juffa. You are indeed a powerhouse of information on all things CUDA.

Now, of course, I’ll have to revise my code to use (4 X 384) threads per SM, instead of my previous (8 X 1024) threads per SM. Eminently doable though…

However, (he said, pointing his index finger at the ceiling)…

I recall one of your more salient previous posts, in which you state:

“For memory-bound streaming kernels on Fermi-class GPUs it is in general beneficial to target a total thread-block count per grid of at least 20x the number of concurrently running thread blocks. Let us assume we are able to concurrently run four thread blocks of 384 threads per SM on a GPU with 14 SMs. One would want to run a grid with at least 20 x 4 x 14 = 1120 thread blocks to achieve the maximum memory throughput. The exact amount of “over-subscription” necessary to maximize the memory throughput is a function of many variables, in particular the ratio of computation to bandwidth, but 20x is a useful rule of thumb. Running with an “over-subscription” factor that is too small can reduce the achievable memory bandwidth by up to 15% if I recall correctly from experiments with the various kernels from the STREAM benchmark.”

Sorry to have quoted the whole thing, but I just didn’t think a link would have had the same impact…

So even back then, you knew that (4 X 384) = 1536 would be the maximum number of threads required for full occupancy - I just wasn’t listening at the time (for reasons unknown)…

My point though, is whether or not I should apply such an “over-subscription” factor in a program that is designed to sniff out and exploit, to their fullest capacity possible, all NVidia GPUs, regardless of whether or not they’re “Fermi-class”.

I’m currently using an “over-subscription” factor of 16, which in essence means that I (or more specifically, my program) should be generating (4 X 384 X 16) threads per SM, in order to “maximize the memory throughput”. You recommended a factor of 20, but I have to admit, that number just seemed too huge to me (and still does), so I finagled it…

I’m not a hardware guy, so I’m actually not at all sure what you’re talking about when you refer to the “memory throughput”, at least not in this context, but the important point for me is your comment that not using an “over-subscription” factor, or using one that is too small, can “reduce the achievable memory bandwidth by up to 15%”. Well for me, assuming that same would directly affect the execution time by the same amount, that contingency is just too damn scary to ignore (excuse my French)…

So do you, in your expert opinion, think I should keep using an “over-subsription” factor of 16? Is that a good “guesstimate” for a psuedo-generic “over-subscription” factor, or is there significantly more to the story?

Thanks in advance for any input you can provide…

If your app is limited by memory bandwidth, my recommendation for Fermi-class GPU still stands. The more compute-bound your app is the less important is the “oversubscription”. In my experiments a factor of about 20x was needed to achieve maximum memory throughput for completely memory bound code with minimal computation. A smaller factor will give you slightly less than optimal throughput. By no means would I recommend contorting one’s code to get to the desired factor of 20x; this is simply a heuristic for cases where it is fairly trivial to select the total number of blocks & threads running and the code is know to be limited by memory throughput.

“Oversubscription” is not needed on sm_1x devices to maximize the memory throughput. For sm_3x I haven’t run experiments of that nature yet.

Thanks for your prompt and informative reply…

I thought of a solution while I was reading your reply, because it occurred to me that what you’re talking about reminds me very much of a “hardware calibration”, which I have some experience in (I used to write software for CNC machines)…

So, being a software person, the solution that occurs to me would simply be to have the program “self-calibrate” each GPU on startup. Simple.

All it would have to do is accurately measure the execution time of each GPU (for the one and only kernel involved) using “over-subscription” factors of 1, 2, 4, 6, 8, 10, … up to 20, then divide the execution time by the “over-subscription” factor itself. Then the smallest value would be the “over-subscription” factor to use for that particular GPU. Again, eminently doable, albeit more work for me…

I think that would work. What do you think?

One approach that is trivial to implement is to always use a very large thread count where easily possible. This provides for optimal performance with Fermi GPUs and doesn’t hurt on sm_1x GPUs. I have used this approach successfully for simple streaming kernels.

I agree that auto-tuning appears to be a good strategy to adapt code to the differences between GPU generations in general. Note that auto tuning is not a new idea. It has a fairly long tradition for optimized CPU libraries such as Atlas and FFTW.

I performed some manual tuning experiments on GPUs with a small set of code and the performance differences between a one-size-fits all approach and one where the launch configuration was tuned for a particular GPU was up to 10%. In addition to the launch configuration I also experimented with different amounts of padding for data structures. There are probably other tradeoffs one could consider for an auto-tuning approach, e.g. exchanging integer and single-precision floating-point operations, but I haven’t looked into that in detail.

How much work you want to invest into an auto-tuning approach will primarily be a function of how badly you want to extract the last few percent of performance gain at the end of the development cycle. For example, if you have already achieved a 5x application acceleration by using GPUs, an additional 10% performance increase may not seem necessary or attractive relative to the cost. This is basically no different from the optimization process for CPUs, where one can get into details of cache and TLB usage optimization, code layout optimization during linking, etc as more extreme forms of squeezing the last ounce of performance from a given code.