Method to Cycle steal DMA write into DDR5

talisin9 · December 6, 2017, 9:12am

I noticed Bob excellent cover of Array issue … in part related to these questions,

" For your particular case, where the group size is reasonably large (441) and less than the max limit on threads per block (1024), I think it makes a lot of sense, at least performance wise, to assign each group to a block. With this method we can dispense with atomics entirely."

Before setting out a TCAM (ternary CAM) operation of 1023 samples each with 128 8bit values, i believe this is ideal for GTX960 - 980 under CUDA. The issue with such a simple correlation is the necessary rate >100,000 per Sec) or close to. Thus, one front server is assigned to feed the 128 bytes at said rate thru IB network card, into 2 back Servers each holding millions of medical samples (separated into one of the said 1023 possible test classes).

Each back server has similar IB adapter, holds samples in database RAM (in-mem) memcache, no harddrive is used, except in separate backup /restore ops.

THE QUESTION is how to get the 128 8bit values up into the GPU to test a best-fit algorithm without having to transfer data from server RAM (fewer write ops) As the address is implied by packet header the DMA engine is wasted… It would be much better if GPUs came with 5Gb/S ports but they do not… thus the term “cycle stealing” a PCIe write from the inbound IB adaptor… Is there an operation possible on GPU to synchronize the IB transfer to both server mem and GPU mem as a simultaneous write ? the GPU address is defined in the packet header , naturally datagram is best but i dont think a zero-overhead translation (or any GPU compute? ) available…

wkailey · December 6, 2017, 3:47pm

This may help, but I’m not sure, because I do not completely understand everything in your question. But FWIIW: in my recent successful GPU development, I used shared RAM between the GPU and the host processor. This was done by allocating pinned RAM. The same block of RAM then had GPU pointers and host processor pointers and appeared at different addresses on each machine. Like I said, this may or may not help you.

talisin9 · December 7, 2017, 4:11am

Sharing pinned RAM requires overhead (cudaMallHost // as does malloc,et al…) The annoying feature of all APIs is benchmarks are geared for a round robin ( alloc //transfer // dealloc ) cycle, as per https://www.cs.virginia.edu/~mwb7w/cuda_support/pinned_tradeoff.html Totally useless.

To say nothing of the OS kernel hugging, which deters efficiency. Looking at the pinned mem speedup Fig3 shows returns above 128MB transfers as making sense. Not only is that bollocks, its about as useful as bollocks on a bull. I need CUDA to divorce the OS. Thats the idea behind Accel process.
Not taking 6 mS to transfer any data from 1byte to 1 MB as the flat line shows in Fig 1 of ref.

To begin with i dont wish to change the Array size … it is prescribed both in depth (128 samples and range (1023 test classes) only the membership (volume) increases (think of patients) up to a max DRAM capacity before moving to a new node. It appears an overhead exists in each assignment in CUDA as it does for C++ (bounds checking, nested loops, etc…) THAT is what i want to avoid.

RDMA is an example of OS kernel bypass, once QPair is assigned (and few details) U are talking of host RAM being pinned by cudaMallHost. with PCI transfers that include address …from host or GPU
Thats the old dumb way to do things.

i am talking of cycle stealing a write op (from an IB card PCI bus) as i said nVidia arnt that smart to include the obvious (an IB port) whilst wasting 10yr developing a bottleneck API (CUDA)
Maybe they can hire farsighted engineers … or pay my royalties when the patent issues (not just a pretty IB interface)

njuffa · December 7, 2017, 5:44am

I would say forget the linked document at cs.virgina.edu (NOW!) and do you own measurements using a platform and transfers sizes relevant to your use case.

I assume you already know that you do not want to use Windows as a platform, where with the WDDM driver model the OS is in charge of GPU memory allocation.

(1) When using pinned memory, one will normally hold onto a pinned buffer for as long as one can, and re-use it as often as you can. Allocating pinned buffers is costly because it requires physically contiguous pages. The linked write-up seems to assume an allocate / copy / deallocate cycle, which is a completely unsuitable approach.

(2) The performance difference between transfers from/to pageable host memory vs pinned host memory depends significantly on the throughput of the host’s system memory. Use a host with as many DDR4 channels as you can afford and as high a speed of DDR4 as the platform will support (usually -2400 or -2666 at this time) if you want transfers from/to pageable memory to be fast.

(3) PCIe uses packetized transport so throughput increases with transfer size, before leveling off at the upper limit around a transfer size between 8 MB and 16 MB. It is pretty bad for small transfers, such as the 128 byte blocks envisioned here, so try batching if you can. However the minimum time for small transfers should be a lot better than what is shown in the graphs (assuming I understand the graphs correctly), so do your own throughput measurements to figure out whether you are even close to the requirements for the use case.

Not sure what overhead you are referring to. CUDA is a language in the C++ family, so yes, the memory access semantics are generally the same. Since the CUDA toolchain is derived from LLVM, it also incorporates all the usual sophisticated optimizations applied by modern C++ compilers. By parallelizing with CUDA, loops should disappear or reduced relative to equivalent CPU code, and standard optimizations like strength reductions can be applied to address computation. C++ doesn’t do bounds-checking for ordinary arrays, neither does CUDA. If you have out of bounds array accesses, bad things can and probably will happen in your code.

talisin9 · December 7, 2017, 6:50am

Using Centos7.3 / IB ConnectX2 (new 2016 dual port 40Gb) with backend Lustre 2.10.2 (hot off press) AS u can imagine this is not a walk in the park to get meaningful bench without lots of PRE-planning. In-mem OS and In-mem DB are hard to mount. Happy to send 4 nodes, if u’re in the US.
Or hook u into the data center. My main issue is im a hardware eng. IF i want something done i design HW lookup tables / state machines. Those days are gone. Enter Xilinix and HMC.

The GPU has a major deficiency : LACK of isolated i/O. Damn PCI is like a mains water pipe.

yes i hear what yor saying, better to batch the transfers, but they come in from each client as a 256byte (repeat x2) packetised message. We can collect them, as random 256byte samples, but each time a transfer occurs its thru IB thus PCI (the only thing receiving Eth is the http// front end)

Thus for each IB transfer, i begin by asking is a method exists on CUDA to sniff the PCI bus on a host write WITHOUT INCURRING overhead. We can rename it ZeroCopy , but how to synchronise the transfer into GPU RAM ?? i can provide a decoded write pulse from the IB card that delivers each data ( further, as the http// Server accumulates all client data , its possible to burst several Meg into the cluster host RAM, only i prefer to route it directly into each subject page (theres a fixed total of 1023 subjects) rather than stuff around sorting from any random host node.

The idea is “packet decode addressing” goes straight to a table in RAM … no buffer (apart from QP) and the nuisance OS prerequisites (kept to a minimum) its a shame everything hangs off an OS …

This table in host RAM requires one write, and that is the same write i need to direct a copy to GPU
Are u able to see the question ??? It begins from that small step… How to cycle steal on CUDA.
Very basic operation, if CUDA cant do it , its a waste of time.

njuffa · December 7, 2017, 7:11am

Sorry, it has been twenty years since I was involved in the design of hardware, and that was core units in x86 microprocessors, none of that nasty I/O stuff :-)

This is a forum for CUDA programming questions, while your questions seem to revolve more around hardware issues. That means we have at best an impedance mismatch and at worst a disconnect: I am not really catching on what your use case is and what issues you are facing. In the other direction, you probably have little prior experience with GPUs and relevant programming concepts.

I assume you have studied NVIDIA’s documentation on RDMA and are using an IB adapter that comes with an RDMA-enabled driver (such as a Mellanox IB solution) so you can send data directly from IB to the GPU without going through the host memory. I don’t think cycle-stealing (as I understand the concept from my school days) is a relevant concept with these components anymore, but I could be wrong.

NVIDIA makes GPUs with PCIe interface because that is what the common relevant system platforms use. NVIDIA did not design PCIe, so it is what it is. If you want to spend more money, you can get GPUs with NVLINK, which may or may not suit your use case better. Before you ask, I know nothing about NVLINK other than that it provides significantly higher throughput than PCIe gen 3.

You may want to ring the local NVIDIA office and ask to be put in touch with a field application engineer oriented towards hardware who is familiar with I/O interfacing issues and the relevant hardware lingo.

wkailey · December 7, 2017, 3:28pm

Well said, njuffa:

When using pinned memory, one will normally hold onto a pinned buffer for as long as one can, and re-use it as often as you can. Allocating pinned buffers is costly because it requires physically contiguous pages. The linked write-up seems to assume an allocate / copy / deallocate cycle, which is a completely unsuitable approach.

In my application using pinned memory, each thread operates on a fairly small amount of memory, typically a few kilobytes; but I allocate the memory in a constructor. The object I construct gets used over and over, and the few kilobytes per thread adds up to many gigabytes over the life of the application. Maybe this strategy is not applicable to talisin9’s application. I have no way of knowing.

talisin9 · December 8, 2017, 1:23am

njuffa . Appreciate the IB to GPU direct via RDMA, of course im aware of it (as well the 90,000 read ops/ Sec possible via NFSoIB,(and Mell dropout to NFS support, due to outshining RoCE perf) but the point is missed. The wood for the trees is a continual social syntax error) the x86 arch is finished. it has enjoyed a long soujourn, if only on the back of compiler tech. All that is about to change dramatically. As also the Network Stack, filled with hung options never used.

wkailey has a good practical point, i can use 4k thru 16k transfer (in block T) but for an in-mem database (so far limited to roll yor own (not immediately appealing due to need to release quickly) or memCache or AeroSpike the entire OS must operate in-mem … including the DB … the many GB pointed out, does not provide over what timeframe … sure with a pure C / C++ Appl many GB is standard , but not within 60 Seconds. The kernel stack (as all stacks) must prioritize tasks to a limited resource (available threads to the CPU) if i have 50,000 x 4 packet (256byte) messages to route / Sec, the address decode / data strip overhead of each message is considerable. PArticularly if this is a constant data rate, (it isnt, but the plan is for worst case traffic) Thus here is the GPU and CUDA question… that fits right into the vast CUDA toolset… this ought to contain “cycle stealing” instruction (or several operating in conjunction with constructors) i recall bit-blitter as an early block transfer mechanism used by game writers…

Unlike most othe Applic that can wait for the x86 to free up process… i need to operate at the stream which is why GPUs were invented. Stream processing … 32 thread blocks with self address indexing … i understand that is an internal core operation to GPU memory (DDR5 kernel or global) i also understand all convention takes control of the bus, locking any contention out. Thus i dont need to do RDMA twice to get data into both host mem and GPU at the same time … IS this clearer? Can i spell it out again (groundhog day) Is there a sync workaround instruction in CUDA or at the ASSEM level on any GPU (pref 960 thru 980) that can snoop the data on a host write from PCI (that is where the data from IB or any card will transit in a peripheral R/W ) … Or otherwise can the host northbridge share a GPU write with a simultanous write to host mem ? (less likely) i need 2 writes to occur in the one instuction interval … Anyone doing CUDA ought to know the fastest way to conserve another write instruction … via at leaast a serial write (a few clocks extra, delaying the data) Either via a RDMA which is the IB adapter is on the same host as the GPU, it is simply a DMA (faster than RDMA ) … Whos counting how many different ways to describe this exists?

Topic		Replies	Views
From NIC to GPU. CUDA Programming and Performance	40	13587	February 12, 2011
Wishlist Place your considered suggestions here CUDA Programming and Performance	201	204317	April 13, 2009
GPU Perfomance How much GFlops??? CUDA Programming and Performance	27	37349	August 30, 2009
Unified Memory in CUDA 6 Technical Blog	87	1899	August 16, 2019
Real-time GPU processing Peer 2 peer data copy, Linux kernel memory, kernels in kernel, CUDA Programming and Performance	35	8100	June 30, 2010
Dazed and Confused.. CUDA Programming and Performance	6	1412	April 8, 2013
What can't you do in CUDA that you'd like? Requests for the future CUDA Programming and Performance	407	134568	May 26, 2010
Benchmarking GPUDirect RDMA on Modern Server Platforms Technical Blog	40	2736	April 11, 2019
GPU and CPU don't run in (pure) parallel ? CUDA Programming and Performance	24	20146	May 4, 2007
How to Overlap Data Transfers in CUDA C/C++ Technical Blog	23	2227	January 18, 2023

Method to Cycle steal DMA write into DDR5

Related topics