Actually, I need to sum matrix, and calculate the average.
The data are from a camera (6004*7920), in 8 bits. Because of the amount of data, the GPU is the most efficient way to obtain a sum of 5 of those images.
For that, I use the example vectorAdd. I’m very very happy about the calculation time, it’s about 0.08ms against 700ms using the CPU (in sequential).
My problem is about the time to transfert data.
I copy 5 vectors from the CPU to the GPU, of 45 Mo, it’s about 225Mo, and I calculate a data speed transfert of 2 Go/s.
Then, I copy 1 vector of 90 Mo (unsigned short), and I calulate a data speed transfert of 1 Go/s.
Can you confirm ?
Moreother, is there a way to increase the data speed transfert, using an other memory ?
Thank you for your answer, I change the way to allocate memory.
I have a question.
Actually, I’m using now CudaMallocHost to use the PINNED memory.
The transfert from host to device doesn’t change, it is always about 2.2Go/s. It is seems to be reasonable.
On the other hand, transfert from device to host change a lot !
I handle an image of 7920*6004, in unsigned short, e.g. 90.7 Mo.
Without PINNED memory, the time to download the image is about 90 ms, ~1Go/s
Without PINNED memory, that time is about 0.3 ms, ~300Go/s.
The final result is right, but I do not understand that speed.
PINNED memory is one kind of zero copy memory.
You don’t need to manually copy the buffer from CPU to GPU.
The buffer can be accessed via CPU/GPU directly.
Just get the CPU/GPU buffer pointer and add a synchronize call before accessing.
Here is a document for you reference:
With that method, I have a problem to copy element. In the following code, I just try to make a copy of an image using GPU. I cope with the following difficulty.
The image that I handle is an array of unsigned char, provide by openCV (see the code).
Actually, that is just a way to simulate an amount of data provide by a camera.
With the TX2; I should be able to handle all data contains in the RAM, as CPU and GPU share it. For that, we have to allocate memory, and handle the information thanks to pointer.
cudaSetDeviceFlags(cudaDeviceMapHost);
int height = 6004 ;
int width = 7920 ;
int NumElement = height*width ;
unsigned char *img1 = NULL ;
unsigned short *imgf = NULL ;
unsigned char *img1_d = NULL ;
unsigned short *imgf_d = NULL ;
If use : img1 = src1.data
The cuda function doesn’t work.
But if I use : for(int i=0 ; i<NumElement ; i++) img1[i] = (src1.data)[i] ;
The cuda function works well. But it takes a long time to copy the elements.
Thank you for your answer.
That method works perfectly with CPU function. The problem is not from src1.data.
Here, I have the same problem, without using opencv :
[...]
unsigned char *h_A = NULL ;
cudaHostAlloc((void **)&h_A, size_uchar, cudaHostAllocMapped) ;
std::cout << "&h_A v1 = " << &h_A << std::endl ;
test = new unsigned char[height*width] ;
for (int i = 0; i < numElements; ++i)
{
test[i] = 10 ;
}
h_A = test ;
std::cout << "&h_A v2 = " << &h_A << std::endl ;
[...]
I obtain :
&h_A v1 = 0x7fede98ee8
&h_A v2 = 0x7fede98ee8
Not that if I print any value of h_A, I obtain 10 here. But I can’t launch kernel … As I said, I can only launch kernel if I fill h_A using that method :
for (int i = 0; i < numElements; ++i)
{
h_A[i] = 10 ;
}
The src1 is allocated with cv::Mat, which is a general CPU buffer.
Once you assign the buffer address from img1 = src1.data.
The buffer type of img1 change from mapped memory into a generate CPU buffer that cannot be accessed via GPU.
But if you copy the buffer by value, this won’t change the buffer type but only value.
You can double check if this is the cause of this issue.