Mmap shared memory / cuda calculations are slow

I’m developing an application that passes data between threads and processes using mmap. Some threads do CPU processing, and I’d like some that use CUDA, all targeted for AGX Jetson.

I’ve been trying different experiments, and I’m surprised to find that the CUDA implmentations are surprisingly slow, even for things like matrix multiplication. For example a (4x4) times (4x130000) matrix multiplication with Eigen takes about 3.5 ms on CPU, but using cublasSgemm it takes about 8.7 ms. If I hand-code the CUDA multiplication, that comes down to 3.7 ms, no faster than CPU.

This is NOT true on an x86 desktop, where the CUDA operation is faster.

Details:

I’ve found the fastest way to do this is to take the mmap’ed memory at initialization and do this on each of the input and output buffers:

mmap_ptr = mmap(...) + some_offset; cudaHostRegister(mmap_ptr, size, cudaHostRegisterMapped | cudaHostRegisterPortable ); cudaHostGetDevicePointer(&device_ptr, mmap_ptr, 0);

And then calling my code with the device_ptr.

fourByFourMultiply<<<(n+255)/256,256>>>(four_by_four_device_ptr, input_device_ptr, output_device_ptr);
cudaDeviceSynchronize();

Including a thrust::copy into the four_by_four_device_ptr , this takes 3.8 ms.

However, if I do an explict copy from my mmap_ptr to a thrust_vector (i.e. allocated into cuda), and copy the buffer in, the total time is more, but the time to run the fourByFourMultiply is only 1ms.

(Running cublasSgemm is considerably slower, about 3.1ms per call, because I gather it’s not well optimized to oblong matricies.)

So, clearly a bunch of time is being taken by some kind of hidden memory transfer happening under the hood.

Is there a better way to tackle this? I’m surprised that a straightfoward task like this for GPU is not considerably faster than CPU (once overhead is counted). Or any tweaks I can try?

Hi,

Have you maximized the device performance like below?

$ sudo nvpmodel -m 0
$ jetson_clocks

We test GPU matrix calculation in our CDUA sample.
It takes around 1.1ms for a 32x4 matrix multiply a 4x160000 matrix.

$ ./matrixMul -hA=32 -wA=4 -hB=4 -wB=160000
[Matrix Multiply Using CUDA] - Starting...
GPU Device 0: "Xavier" with compute capability 7.2

MatrixA(4,32), MatrixB(160000,4)
Computing result using CUDA Kernel...
done
Performance= 36.74 GFlop/s, Time= 1.115 msec, Size= 40960000 Ops, WorkgroupSize= 1024 threads/block
Checking computed result for correctness: Result = PASS

NOTE: The CUDA Samples are not meant for performancemeasurements. Results may vary when GPU Boost is enabled.

Thanks.

Yes, I’m running in MAX_N, which I believe is mode 0.

The time you record is similar to what I get, but I still find that the CPU is of similar speed, which is surprising.

There is no update from you for a period, assuming this is not an issue any more.
Hence we are closing this topic. If need further support, please open a new one.
Thanks

Hi,

It’s expected that matrix multiplication should run much faster on GPU.

We want to compare the CPU/GPU performance in our environment as well.
Would you mind sharing your testing code with us so we can give it a check?

Thanks.