Hi to everybody!
I’m an Italian student and I’m new with CUDA and graphic boards and I’m doing some studies with GPU, in particular regarding latency with different types of memory and data storing.
I generate randomly a set of N data (numbers), then I launch a simple kernel that do a dot product.
Using cudaEventRecord I’m measuring latency of storing data with pageable memory (cudaMalloc), pagelocked or pinned (cudaMallocHost) and zerocopy (cudaHostGetDevicePointer)
I was expecting that zero copy would be the fastest, but for N >~ 50000, the pagelocked not mapped is faster.
Could the pagelocked be faster than zero copy? why? Or I’ve made some mistake?
Example: some times (expressed in milliseconds) with the three types of memory, for different numbers of data input(generate randomly)
the average and the error is taken with 300 different run for the same number of data
|Pageable|PageLocked____|Zero Copy
N_Data_|Mean____err____|Mean____err___|Mean____err
512____|0,23____0,02___|0,16____0,01___|0,06____0,01
1024___|0,23____0,02___|0,16____0,01___|0,07____0,01
2048___|0,26____0,02___|0,17____0,01___|0,07____0,01
4096___|0,32____0,02___|0,19____0,01___|0,09____0,01
10240__|0,49____0,02___|0,25____0,01___|0,14____0,01
51200__|1,58____0,06___|0,64____0,03___|0,65____0,03
76800__|2,28____0,05___|0,88____0,04___|0,96____0,04
92160__|2,70____0,06___|1,03____0,04___|1,15____0,05
102400_|2,97____0,09___|1,11____0,06___|1,27____0,08
128000_|3,60____0,17___|1,34____0,07___|1,58____0,10
307200_|6,74____0,23___|3,06____0,20___|3,77____0,26
512000_|10,13___0,48___|5,01____0,34___|6,23____0,42