Mmap shared memory / cuda calculations are slow

nathaniel.tagg · September 2, 2022, 4:06pm

I’m developing an application that passes data between threads and processes using mmap. Some threads do CPU processing, and I’d like some that use CUDA, all targeted for AGX Jetson.

I’ve been trying different experiments, and I’m surprised to find that the CUDA implmentations are surprisingly slow, even for things like matrix multiplication. For example a (4x4) times (4x130000) matrix multiplication with Eigen takes about 3.5 ms on CPU, but using cublasSgemm it takes about 8.7 ms. If I hand-code the CUDA multiplication, that comes down to 3.7 ms, no faster than CPU.

This is NOT true on an x86 desktop, where the CUDA operation is faster.

Details:

I’ve found the fastest way to do this is to take the mmap’ed memory at initialization and do this on each of the input and output buffers:


mmap_ptr = mmap(...) + some_offset;
cudaHostRegister(mmap_ptr,  size, cudaHostRegisterMapped  | cudaHostRegisterPortable );
cudaHostGetDevicePointer(&device_ptr, mmap_ptr, 0);

And then calling my code with the device_ptr.
fourByFourMultiply<<<(n+255)/256,256>>>(four_by_four_device_ptr, input_device_ptr, output_device_ptr); cudaDeviceSynchronize();
Including a thrust::copy into the four_by_four_device_ptr , this takes 3.8 ms.

However, if I do an explict copy from my mmap_ptr to a thrust_vector (i.e. allocated into cuda), and copy the buffer in, the total time is more, but the time to run the fourByFourMultiply is only 1ms.

(Running cublasSgemm is considerably slower, about 3.1ms per call, because I gather it’s not well optimized to oblong matricies.)

So, clearly a bunch of time is being taken by some kind of hidden memory transfer happening under the hood.

Is there a better way to tackle this? I’m surprised that a straightfoward task like this for GPU is not considerably faster than CPU (once overhead is counted). Or any tweaks I can try?

AastaLLL · September 5, 2022, 2:57am

Hi,

Have you maximized the device performance like below?

$ sudo nvpmodel -m 0
$ jetson_clocks

We test GPU matrix calculation in our CDUA sample.
It takes around 1.1ms for a 32x4 matrix multiply a 4x160000 matrix.

$ ./matrixMul -hA=32 -wA=4 -hB=4 -wB=160000
[Matrix Multiply Using CUDA] - Starting...
GPU Device 0: "Xavier" with compute capability 7.2

MatrixA(4,32), MatrixB(160000,4)
Computing result using CUDA Kernel...
done
Performance= 36.74 GFlop/s, Time= 1.115 msec, Size= 40960000 Ops, WorkgroupSize= 1024 threads/block
Checking computed result for correctness: Result = PASS

NOTE: The CUDA Samples are not meant for performancemeasurements. Results may vary when GPU Boost is enabled.

Thanks.

nathaniel.tagg · September 6, 2022, 6:52pm

Yes, I’m running in MAX_N, which I believe is mode 0.

The time you record is similar to what I get, but I still find that the CPU is of similar speed, which is surprising.

AastaLLL · September 12, 2022, 3:34am

There is no update from you for a period, assuming this is not an issue any more.
Hence we are closing this topic. If need further support, please open a new one.
Thanks

Hi,

It’s expected that matrix multiplication should run much faster on GPU.

We want to compare the CPU/GPU performance in our environment as well.
Would you mind sharing your testing code with us so we can give it a check?

Thanks.

Topic		Replies	Views
best possible matrix-vector multiplication performance? poor guy with only an emulator wonders about CUDA Programming and Performance	6	5702	August 12, 2009
Cuda matrix multiplication too slow CUDA Programming and Performance	5	13408	February 17, 2010
Help with CUBLAS performance and timing issues, please help... CUDA Programming and Performance	1	3492	December 26, 2008
Matrix multiply with shared memory works too slow in my code CUDA Programming and Performance	2	134	October 21, 2024
Tell me a way to GPU speed up with banal elementwise multiplication.. Thanks in advance! CUDA Programming and Performance	15	3828	August 5, 2017
Gemm runs too slow with shared memory CUDA Programming and Performance cuda	2	123	October 21, 2024
Is CUDA really that fast? CUDA Programming and Performance	17	11938	September 21, 2009
CUDA gemm with shared memory is slower than with global memory CUDA Programming and Performance cuda	3	143	November 4, 2024
Confused about GPU vs CPU speed in multiplication CUDA Programming and Performance	8	6668	February 19, 2009
Slow CUDA SGEMM CUDA Programming and Performance	5	757	September 15, 2022

Mmap shared memory / cuda calculations are slow

Related topics