(Since unified memory usage is different between Pascal and earlier models,
and it has bug when running on Windows,
I am testing without unified memory.)
For just running program for image calculation (calculate disparity, etc),
GTX1060 runs slower than GTX960,
and it is even slower than notebook GPU (slower than 940M).
(do not include transmission time, just the kernel running time)
Is there something (etc: warp usage?) should be changed for running on Pascal?
I have checked some documentation pages, but cannot find useful information.
I don’t really know how you are compiling in each case
I don’t really know how you are measuring timing in each case
If you got the compile options wrong, you may be measuring JIT compilation time in one case
There are lots of possibilities here which might explain your data. My guess is you are doing something wrong in your comparison, and that GTX 1060 is not actually slower than GTX960.
I think Test case is too long to paste here. And I will try to do some other simpler test.
parameters for nvcc is only -gencode=arch=compute_50,code=sm_50 -gencode=arch=compute_50,code=compute_50 and -gencode=arch=compute_61,code=sm_61 -gencode=arch=compute_61,code=compute_61
I measured the kernel running time using Visual Profiler on Windows. Run kernel 10 times and picked up 1 time in the middle.
I am not familiar with this part. Could you please give any introduction link about this?
I have done more test today, and finally find out the reason:
I haven’t replaced all the unified memory with device memory, there were some pointers remained.
When I replaced all cudaMallocManaged with cudaMalloc, the problem is resolved.
1060 runs really faster than 960.
Using unified memory on Windows is really different for Pascal…
Unified got super slow with CUDA 8.0 unless you can use cudaMemPrefetchAsync to preallocate the data you want to use on the GPU (it was great before CUDA 8.0…!). However this API is broken under Windows for the time being - see: