GPU data speed

anon79190853 · March 24, 2020, 3:42pm

Hello,

Actually, I need to sum matrix, and calculate the average.
The data are from a camera (6004*7920), in 8 bits. Because of the amount of data, the GPU is the most efficient way to obtain a sum of 5 of those images.

For that, I use the example vectorAdd. I’m very very happy about the calculation time, it’s about 0.08ms against 700ms using the CPU (in sequential).

My problem is about the time to transfert data.
I copy 5 vectors from the CPU to the GPU, of 45 Mo, it’s about 225Mo, and I calculate a data speed transfert of 2 Go/s.
Then, I copy 1 vector of 90 Mo (unsigned short), and I calulate a data speed transfert of 1 Go/s.
Can you confirm ?

Moreother, is there a way to increase the data speed transfert, using an other memory ?

Thank you for your help !

AastaLLL · March 25, 2020, 3:17am

Hi,

There are some memory copy / bandwidth sample in our CUDA sample.

/usr/local/cuda-10.0/samples/1_Utilities/

For example, if you use unified memory:

nvidia@tx2:/usr/local/cuda-10.0/samples/1_Utilities/UnifiedMemoryPerf$ ./UnifiedMemoryPerf
GPU Device 0: "NVIDIA Tegra X2" with compute capability 6.2

Running ........................................................

Overall Time For matrixMultiplyPerf

Printing Average of 20 measurements in (ms)
Size_KB  UMhint UMhntAs  UMeasy   0Copy MemCopy CpAsync CpHpglk CpPglAs
4         0.519   1.045   0.529   0.169   0.246   0.217   0.324   0.259
16        0.538   1.185   0.533   0.508   0.351   0.362   1.087   0.550
64        0.761   1.527   0.969   0.976   0.792   0.653   1.036   0.875
256       1.662   2.529   2.759   2.568   1.707   1.646   2.622   2.481
1024      6.429   7.329  11.007  10.797   7.621   7.209  11.212  10.937
4096     36.319  37.589  54.768  54.167  39.362  39.229  54.683  54.112
16384   233.387 234.616 295.734 328.141 249.247 250.449 317.939 312.358

NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.

Thanks.

anon79190853 · March 25, 2020, 10:59am

Hello,

Thank you for your answer, I change the way to allocate memory.
I have a question.
Actually, I’m using now CudaMallocHost to use the PINNED memory.
The transfert from host to device doesn’t change, it is always about 2.2Go/s. It is seems to be reasonable.
On the other hand, transfert from device to host change a lot !

I handle an image of 7920*6004, in unsigned short, e.g. 90.7 Mo.
Without PINNED memory, the time to download the image is about 90 ms, ~1Go/s
Without PINNED memory, that time is about 0.3 ms, ~300Go/s.

The final result is right, but I do not understand that speed.

cudaMemcpy(h_IMGF, d_IMGF, size2, cudaMemcpyDeviceToHost);

with size2=95103360 bytes.

I launch the sample bandwidth, and I have the following result :
[CUDA Bandwidth Test] - Starting…
Running on…

Device 0: NVIDIA Tegra X2
Quick Mode

Host to Device Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 17453.5

Device to Host Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 17439.1

Thank you for your help !

AastaLLL · March 26, 2020, 4:12am

Hi,

PINNED memory is one kind of zero copy memory.
You don’t need to manually copy the buffer from CPU to GPU.
The buffer can be accessed via CPU/GPU directly.

Just get the CPU/GPU buffer pointer and add a synchronize call before accessing.
Here is a document for you reference:

Thanks.

anon79190853 · March 27, 2020, 3:52pm

Hi AastaLLL,

Thank you for the indication. Now I am using it.

With that method, I have a problem to copy element. In the following code, I just try to make a copy of an image using GPU. I cope with the following difficulty.

The image that I handle is an array of unsigned char, provide by openCV (see the code).
Actually, that is just a way to simulate an amount of data provide by a camera.
With the TX2; I should be able to handle all data contains in the RAM, as CPU and GPU share it. For that, we have to allocate memory, and handle the information thanks to pointer.

cudaSetDeviceFlags(cudaDeviceMapHost);
int height = 6004 ;
int width = 7920 ;
int NumElement = height*width ;
unsigned char *img1 = NULL ;
unsigned short *imgf = NULL ;
unsigned char *img1_d = NULL ;
unsigned short *imgf_d = NULL ;

cudaHostAlloc((void **)&img1, NumElementsizeof(unsigned char), cudaHostAllocMapped);
cudaHostAlloc((void **)&imgf, NumElementsizeof(unsigned short), cudaHostAllocMapped);
Mat src1 = imread(“/home/jetson/Downloads/006_Cylinderhead_ABS15_0_0001.tif”, 0) ;
src1 = src1.reshape(0,1) ;
cudaDeviceSynchronize();
img1 = src1.data ;

If use :
img1 = src1.data
The cuda function doesn’t work.

But if I use :
for(int i=0 ; i<NumElement ; i++)
img1[i] = (src1.data)[i] ;
The cuda function works well. But it takes a long time to copy the elements.

Do you have suggestion ?

Thank you !!

AastaLLL · April 15, 2020, 9:20am

Hi,

Sorry for the late reply.

Could you try to print out the pointer of img1?
Guess that it doesn’t update to the src1.data correctly.

Thanks.

anon79190853 · April 20, 2020, 9:04am

Hi,

Thank you for your answer.
That method works perfectly with CPU function. The problem is not from src1.data.

Here, I have the same problem, without using opencv :

[...]
unsigned char *h_A = NULL ;
cudaHostAlloc((void **)&h_A,  size_uchar,  cudaHostAllocMapped) ;
std::cout << "&h_A v1 = " << &h_A << std::endl ;
test = new unsigned char[height*width] ;
for (int i = 0; i < numElements; ++i)
{
    	test[i] = 10 ;
}
h_A = test ;
std::cout << "&h_A v2 = " << &h_A << std::endl ;
[...]

I obtain :

&h_A v1 = 0x7fede98ee8
&h_A v2 = 0x7fede98ee8

Not that if I print any value of h_A, I obtain 10 here. But I can’t launch kernel … As I said, I can only launch kernel if I fill h_A using that method :

for (int i = 0; i < numElements; ++i)
{
h_A[i] = 10 ;
}

Thank you for your help !

AastaLLL · May 5, 2020, 9:14am

Hi,

We found a possible issue.

The src1 is allocated with cv::Mat, which is a general CPU buffer.

Once you assign the buffer address from img1 = src1.data.
The buffer type of img1 change from mapped memory into a generate CPU buffer that cannot be accessed via GPU.
But if you copy the buffer by value, this won’t change the buffer type but only value.

You can double check if this is the cause of this issue.

Thanks.

Topic		Replies	Views
Cuda programming Jetson TX2 cuda	4	640	October 18, 2021
Array + Array (1D or 2D): Why is performance of my code TERRIBLE? CUDA Programming and Performance cuda , image-processing	6	64	October 21, 2024
The GPU utilization is low CUDA Programming and Performance	3	2017	November 14, 2014
Transfer data CPU/GPU is an issue.. Jetson TX2	8	1869	October 18, 2021
About Data transfer speed between CPU and GPU? How to increase the data transfer speed? CUDA Programming and Performance	7	15505	December 11, 2009
How to disable zero-copy on TX1? Jetson TX1	4	759	October 18, 2021
Re_arranging Cuda array CUDA Programming and Performance	8	36	September 23, 2024
Zero Copy Memory vs Unified memory CUDA processing Jetson TX1	28	20067	October 18, 2021
Dual problems with unified memory Jetson Nano	8	1206	October 14, 2021
The Zero Copy Shared memory mode consumes more CPU resources (jetson Xavier NX) Jetson Xavier NX tensorrt , cuda , cudnn	6	48	January 6, 2025

GPU data speed

Related topics