GPU data speed


Actually, I need to sum matrix, and calculate the average.
The data are from a camera (6004*7920), in 8 bits. Because of the amount of data, the GPU is the most efficient way to obtain a sum of 5 of those images.

For that, I use the example vectorAdd. I’m very very happy about the calculation time, it’s about 0.08ms against 700ms using the CPU (in sequential).

My problem is about the time to transfert data.
I copy 5 vectors from the CPU to the GPU, of 45 Mo, it’s about 225Mo, and I calculate a data speed transfert of 2 Go/s.
Then, I copy 1 vector of 90 Mo (unsigned short), and I calulate a data speed transfert of 1 Go/s.
Can you confirm ?

Moreother, is there a way to increase the data speed transfert, using an other memory ?

Thank you for your help !


There are some memory copy / bandwidth sample in our CUDA sample.


For example, if you use unified memory:

nvidia@tx2:/usr/local/cuda-10.0/samples/1_Utilities/UnifiedMemoryPerf$ ./UnifiedMemoryPerf
GPU Device 0: "NVIDIA Tegra X2" with compute capability 6.2

Running ........................................................

Overall Time For matrixMultiplyPerf

Printing Average of 20 measurements in (ms)
Size_KB  UMhint UMhntAs  UMeasy   0Copy MemCopy CpAsync CpHpglk CpPglAs
4         0.519   1.045   0.529   0.169   0.246   0.217   0.324   0.259
16        0.538   1.185   0.533   0.508   0.351   0.362   1.087   0.550
64        0.761   1.527   0.969   0.976   0.792   0.653   1.036   0.875
256       1.662   2.529   2.759   2.568   1.707   1.646   2.622   2.481
1024      6.429   7.329  11.007  10.797   7.621   7.209  11.212  10.937
4096     36.319  37.589  54.768  54.167  39.362  39.229  54.683  54.112
16384   233.387 234.616 295.734 328.141 249.247 250.449 317.939 312.358

NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.



Thank you for your answer, I change the way to allocate memory.
I have a question.
Actually, I’m using now CudaMallocHost to use the PINNED memory.
The transfert from host to device doesn’t change, it is always about 2.2Go/s. It is seems to be reasonable.
On the other hand, transfert from device to host change a lot !

I handle an image of 7920*6004, in unsigned short, e.g. 90.7 Mo.
Without PINNED memory, the time to download the image is about 90 ms, ~1Go/s
Without PINNED memory, that time is about 0.3 ms, ~300Go/s.

The final result is right, but I do not understand that speed.

cudaMemcpy(h_IMGF, d_IMGF, size2, cudaMemcpyDeviceToHost);

with size2=95103360 bytes.

I launch the sample bandwidth, and I have the following result :
[CUDA Bandwidth Test] - Starting…
Running on…

Device 0: NVIDIA Tegra X2
Quick Mode

Host to Device Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 17453.5

Device to Host Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 17439.1

Thank you for your help !


PINNED memory is one kind of zero copy memory.
You don’t need to manually copy the buffer from CPU to GPU.
The buffer can be accessed via CPU/GPU directly.

Just get the CPU/GPU buffer pointer and add a synchronize call before accessing.
Here is a document for you reference:


Hi AastaLLL,

Thank you for the indication. Now I am using it.

With that method, I have a problem to copy element. In the following code, I just try to make a copy of an image using GPU. I cope with the following difficulty.

The image that I handle is an array of unsigned char, provide by openCV (see the code).
Actually, that is just a way to simulate an amount of data provide by a camera.
With the TX2; I should be able to handle all data contains in the RAM, as CPU and GPU share it. For that, we have to allocate memory, and handle the information thanks to pointer.

int height = 6004 ;
int width = 7920 ;
int NumElement = height*width ;
unsigned char *img1 = NULL ;
unsigned short *imgf = NULL ;
unsigned char *img1_d = NULL ;
unsigned short *imgf_d = NULL ;

cudaHostAlloc((void **)&img1, NumElementsizeof(unsigned char), cudaHostAllocMapped);
cudaHostAlloc((void **)&imgf, NumElementsizeof(unsigned short), cudaHostAllocMapped);
Mat src1 = imread("/home/jetson/Downloads/006_Cylinderhead_ABS15_0_0001.tif", 0) ;
src1 = src1.reshape(0,1) ;
img1 = ;

If use :
img1 =
The cuda function doesn’t work.

But if I use :
for(int i=0 ; i<NumElement ; i++)
img1[i] = ([i] ;
The cuda function works well. But it takes a long time to copy the elements.

Do you have suggestion ?

Thank you !!


Sorry for the late reply.

Could you try to print out the pointer of img1?
Guess that it doesn’t update to the correctly.



Thank you for your answer.
That method works perfectly with CPU function. The problem is not from

Here, I have the same problem, without using opencv :

unsigned char *h_A = NULL ;
cudaHostAlloc((void **)&h_A,  size_uchar,  cudaHostAllocMapped) ;
std::cout << "&h_A v1 = " << &h_A << std::endl ;
test = new unsigned char[height*width] ;
for (int i = 0; i < numElements; ++i)
    	test[i] = 10 ;
h_A = test ;
std::cout << "&h_A v2 = " << &h_A << std::endl ;

I obtain :

&h_A v1 = 0x7fede98ee8
&h_A v2 = 0x7fede98ee8 

Not that if I print any value of h_A, I obtain 10 here. But I can’t launch kernel … As I said, I can only launch kernel if I fill h_A using that method :

for (int i = 0; i < numElements; ++i)
h_A[i] = 10 ;

Thank you for your help !


We found a possible issue.

The src1 is allocated with cv::Mat, which is a general CPU buffer.

Once you assign the buffer address from img1 =
The buffer type of img1 change from mapped memory into a generate CPU buffer that cannot be accessed via GPU.
But if you copy the buffer by value, this won’t change the buffer type but only value.

You can double check if this is the cause of this issue.