Running kernel slow on Pascal (CUDA8)

Up2U · August 3, 2017, 2:56am

(Since unified memory usage is different between Pascal and earlier models,
and it has bug when running on Windows,
I am testing without unified memory.)

For just running program for image calculation (calculate disparity, etc),
GTX1060 runs slower than GTX960,
and it is even slower than notebook GPU (slower than 940M).

(do not include transmission time, just the kernel running time)

Is there something (etc: warp usage?) should be changed for running on Pascal?

I have checked some documentation pages, but cannot find useful information.

Thank you.

Robert_Crovella · August 3, 2017, 3:05am

are you building a debug project on windows or is it a release project. Kind of sounds like you are comparing a debug build to a release build.

Up2U · August 3, 2017, 3:08am

Thanks for your reply.

I am just testing with the same program (release) on different PC.

(Compiled using SM50, compute61)

Robert_Crovella · August 3, 2017, 3:09am

sm_50 compute_61 is not a valid combination of compile switches

Up2U · August 3, 2017, 4:20am

sorry for not listing clearly:

comparing test using:
-gencode=arch=compute_50,code=sm_50 -gencode=arch=compute_50,code=compute_50

and standalone test for 1060 using:
-gencode=arch=compute_61,code=sm_61 -gencode=arch=compute_61,code=compute_61

Comparing test showed that on 1060 is slower than 960 / 940M.
Standalone test showed almost the same speed for 50 and 61.

Robert_Crovella · August 3, 2017, 4:37am

so clearly you are compiling the project in 2 separate cases.

maybe 1 is a debug project and the other is release. Or there is some other difference in the two cases.

Up2U · August 3, 2017, 5:44am

I think I have done all this in release mode.

In theory, should 1060 not be slower than 960 by just using the existed code?

Robert_Crovella · August 3, 2017, 2:03pm

In theory I would agree. But:

you haven’t provided a test case
I don’t really know how you are compiling in each case
I don’t really know how you are measuring timing in each case
If you got the compile options wrong, you may be measuring JIT compilation time in one case

There are lots of possibilities here which might explain your data. My guess is you are doing something wrong in your comparison, and that GTX 1060 is not actually slower than GTX960.

Up2U · August 4, 2017, 1:27am

I think Test case is too long to paste here. And I will try to do some other simpler test.
parameters for nvcc is only -gencode=arch=compute_50,code=sm_50 -gencode=arch=compute_50,code=compute_50 and -gencode=arch=compute_61,code=sm_61 -gencode=arch=compute_61,code=compute_61
I measured the kernel running time using Visual Profiler on Windows. Run kernel 10 times and picked up 1 time in the middle.
I am not familiar with this part. Could you please give any introduction link about this?

Thank you.

Up2U · August 4, 2017, 3:14am

I have done more test today, and finally find out the reason:
I haven’t replaced all the unified memory with device memory, there were some pointers remained.

When I replaced all cudaMallocManaged with cudaMalloc, the problem is resolved.

1060 runs really faster than 960.

Using unified memory on Windows is really different for Pascal…

John_Smith_Lon · August 18, 2017, 11:05pm

Unified got super slow with CUDA 8.0 unless you can use cudaMemPrefetchAsync to preallocate the data you want to use on the GPU (it was great before CUDA 8.0…!). However this API is broken under Windows for the time being - see:

https://devtalk.nvidia.com/default/topic/981147/cudamemprefetchasync-returns-cudaerrorinvaliddevice/?offset=9

Up2U · August 21, 2017, 1:24am

Thanks for the reply.

I have removed all the code with unified memory now.
And using traditional upload/download for Pascal.

Hope CUDA 9 can fix this problem.