The function gpuGetMaxGflopsDeviceId in the CUDA Toolkit 10.0 helper_cuda.h has a problem.
It does NOT pick the fastest device like the CUDA Toolkit 9.2 does.
Tracing through this in Toolkit 10.0 shows the following:
inline int gpuGetMaxGflopsDeviceId()
DEVICE 0
deviceProp = {name=0x00000089f117ede0 “Quadro P2000” uuid={bytes=0x00000089f117eee0 ":Ä}\x1d¥þ¶vÌÀÚÿ—Õõ… } luid=…}
compute_perf = 1516032000
sm_per_multiproc = 128
deviceProp.multiProcessorCount = 8
deviceProp.clockRate = 1480500
DEVICE 1
deviceProp = {name=0x00000089f117ede0 “GeForce GTX 690” uuid={bytes=0x00000089f117eee0 "új¤\x5\\tpÕˆgˆqÊk-\x6… } …}
compute_perf = 1565952000
sm_per_multiproc = 192
deviceProp.multiProcessorCount = 8
deviceProp.clockRate = 1019500
DEVICE 2
deviceProp = {name=0x000000fb4dcfeb50 “GeForce GTX 690” uuid={bytes=0x000000fb4dcfec50 "ráSvœXƒ\x1c"ÇHCd2Î… } …}
compute_perf = 1565952000
sm_per_multiproc = 192
deviceProp.multiProcessorCount = 8
deviceProp.clockRate = 1019500
max_perf_device = 1
In CUDA Toolkit 9.2 it picked Device 0 which is correct.
In CUDA Toolkit 10.0 it picked Device 1 which is incorrect. Firstly, Device 1 is NOT in TCC mode and Device 0 is. Secondly, the compute performance for a GeForce GTX 690 is NOT the same as a Quadro P2000.
As a result, by default the samples will run on the wrong GPU.