Why performance of WinOF-2 in multi-threads is very poor?

johnpub · April 27, 2023, 1:07pm

Hi, All:

I’m working on Windows RDMA(Network Direct) with WinOF-2 driver installed. I found the write and send API is
extremely slow in multi-threads scenario using one QP, it seems there may be locks inside these APIs implementation.

latency of APIs call are roughly as belowing:

300 ns if using 1 thread
600~700ns using 2 threads
5000~5700ns using 5 threads
10~15us (microseconds) using 10 threads

Why is this ?
Can I got source codes of WinOF-2 driver somewhere ?

References:

github.com

microsoft/NetworkDirect/blob/master/docs/IND2QueuePair.md#ind2queuepairwrite

# IND2QueuePair interface
Use to exchange data with a remote peer.
The [IND2Adapter::CreateQueuePair](./IND2Adapter.md#ind2adaptercreatequeuepair) method returns this interface.

The IND2QueuePair interface inherits the methods of the [IUnknown](https://docs.microsoft.com/windows/desktop/api/unknwn/nn-unknwn-iunknown) interface. In addition, IND2QueuePair defines the following methods.

- [__Flush__](#ind2queuepairflush) - Cancels all outstanding requests in the inbound and outbound completion queues.
- [__Send__](#ind2queuepairsend) - Sends data to a remote peer.
- [__Receive__](#ind2queuepairreceive) - Receives data from a remote peer.
- [__Bind__](#ind2queuepairbind) - Binds a memory window to a buffer that is within the registered memory.
- [__Invalidate__](ind2queuepairinvalidate) - Invalidates a local memory window.
- [__Read__](#ind2queuepairread) - Initiates an RDMA Read request.
- [__Write__](#ind2queuepairwrite) - Initiates an RDMA Write request.

__Remarks:__

If you do not retrieve the outstanding requests from the completion queue before releasing your last reference to this queue pair, you may get back requests from the completion queue that were issued on a now-closed queue pair.

## IND2QueuePair::Flush
Cancels all outstanding requests in the Receive and Initiator queues.

This file has been truncated. show original

github.com

microsoft/NetworkDirect/blob/master/docs/IND2QueuePair.md#ind2queuepairsend

# IND2QueuePair interface
Use to exchange data with a remote peer.
The [IND2Adapter::CreateQueuePair](./IND2Adapter.md#ind2adaptercreatequeuepair) method returns this interface.

The IND2QueuePair interface inherits the methods of the [IUnknown](https://docs.microsoft.com/windows/desktop/api/unknwn/nn-unknwn-iunknown) interface. In addition, IND2QueuePair defines the following methods.

- [__Flush__](#ind2queuepairflush) - Cancels all outstanding requests in the inbound and outbound completion queues.
- [__Send__](#ind2queuepairsend) - Sends data to a remote peer.
- [__Receive__](#ind2queuepairreceive) - Receives data from a remote peer.
- [__Bind__](#ind2queuepairbind) - Binds a memory window to a buffer that is within the registered memory.
- [__Invalidate__](ind2queuepairinvalidate) - Invalidates a local memory window.
- [__Read__](#ind2queuepairread) - Initiates an RDMA Read request.
- [__Write__](#ind2queuepairwrite) - Initiates an RDMA Write request.

__Remarks:__

If you do not retrieve the outstanding requests from the completion queue before releasing your last reference to this queue pair, you may get back requests from the completion queue that were issued on a now-closed queue pair.

## IND2QueuePair::Flush
Cancels all outstanding requests in the Receive and Initiator queues.

This file has been truncated. show original

sribhargavid · April 28, 2023, 8:38pm

Hello @johnpub,

Thank you for posting your query on our community.

Regarding your concern for poor throughput, single QP does not provide full line rate. Our design is built for running multiple QPs in parallel thus compensating the single QP rate limit. It is a design limitation, and we don’t have any tuning for it. So, to achieve the full throughput, we would recommend testing with two QPs.

Regarding your question about WinOF-2 source code, we would like to inform you that it is not publicly available. Our engineering team cannot provide it unless there is a special justification for such a request.

If you require further assistance on this, I would suggest you to open a support case for further investigation of the issue. The support ticket can be opened by emailing "Networking-support@nvidia.com "

Please note that an active support contract would be required for the same. If you do not have a current support contract, please reach out to our Contracts team at networking-contracts@nvidia.com

Thank you,
-Nvidia Network Support

johnpub · May 9, 2023, 1:50pm

Hi, does this reply means I should not use one QP in multi-threads, for Sending and Writing ?
I may not need to full line rate or achieve the full throughput, one QP is enough for my upper layer APPs, but must with multi-threads supported.

What I care is: are there some methods to optimize latency of Send or Write API in multi-threads environment, the latency of one single call seems to linearly grows along with the number of threads, which really upset me.

Topic		Replies	Views
IDEA: Intrinsic multi-GPU support (Even over a network) CUDA Programming and Performance	7	9593	January 1, 2009
About NDI interface support in WinOF WinOF Driver	2	1217	April 27, 2023
Can I use streaming to overlap kernels and data transfers in this scenario? CUDA Programming and Performance	13	320	July 5, 2024
Multiple GPUs Devise a synchro mechanism for host threads CUDA Programming and Performance	7	4199	May 13, 2010
Processing Order with Cuda Streams in 7.5 CUDA Programming and Performance	13	1996	June 24, 2016
cuda 4.0rc2 cudaMemcpyPeer(Async) performance issues CUDA Programming and Performance	11	13033	May 3, 2011
RDMA GPU Direct Slow CUDA Programming and Performance	10	2420	February 13, 2019
streams vs. direct use of zero copy memory CUDA Programming and Performance	14	13128	March 30, 2011
Global thread barrier CUDA Programming and Performance	78	85674	December 23, 2011
cant call any kernel function CUDA Programming and Performance	8	4834	June 6, 2011

Why performance of WinOF-2 in multi-threads is very poor?

Related topics