I’m developing an application that passes data between threads and processes using mmap. Some threads do CPU processing, and I’d like some that use CUDA, all targeted for AGX Jetson.
I’ve been trying different experiments, and I’m surprised to find that the CUDA implmentations are surprisingly slow, even for things like matrix multiplication. For example a (4x4) times (4x130000) matrix multiplication with Eigen takes about 3.5 ms on CPU, but using cublasSgemm it takes about 8.7 ms. If I hand-code the CUDA multiplication, that comes down to 3.7 ms, no faster than CPU.
This is NOT true on an x86 desktop, where the CUDA operation is faster.
Details:
I’ve found the fastest way to do this is to take the mmap’ed memory at initialization and do this on each of the input and output buffers:
mmap_ptr = mmap(...) + some_offset;
cudaHostRegister(mmap_ptr, size, cudaHostRegisterMapped | cudaHostRegisterPortable );
cudaHostGetDevicePointer(&device_ptr, mmap_ptr, 0);
And then calling my code with the device_ptr.
fourByFourMultiply<<<(n+255)/256,256>>>(four_by_four_device_ptr, input_device_ptr, output_device_ptr);
cudaDeviceSynchronize();
Including a thrust::copy into the four_by_four_device_ptr , this takes 3.8 ms.
However, if I do an explict copy from my mmap_ptr to a thrust_vector (i.e. allocated into cuda), and copy the buffer in, the total time is more, but the time to run the fourByFourMultiply is only 1ms.
(Running cublasSgemm is considerably slower, about 3.1ms per call, because I gather it’s not well optimized to oblong matricies.)
So, clearly a bunch of time is being taken by some kind of hidden memory transfer happening under the hood.
Is there a better way to tackle this? I’m surprised that a straightfoward task like this for GPU is not considerably faster than CPU (once overhead is counted). Or any tweaks I can try?