When I call cudaMemPrefetchAsync it returns cudaErrorInvalidDevice. I’m compiling on CUDA 8.0 with a GTX1080 GPU, and using compile flags compute_60,sm_60. Full sample code below:
It’s an Alienware 17 laptop with GTX 1080 and an integrated Intel HD Graphics 530. Code runs fine without calling cudaMemPrefetchAsync, but with poor performance when using Managed memory.
Tried compute_61,sm_61 but get the same error. Does the above code run ok for anyone else with a Pascal architecture, i.e. does err = cudaMemPrefetchAsync(data, in.size() * sizeof(int), 0, 0) return 0?
It’s frustrating as I have a large object orientated project that depends heavily on managed memory being fast, and the changes under CUDA 8.0 / Pascal have reduced performance back to CPU speed because the memory is not being pre-cached on the GPU. I’m facing a big redesign to avoid managed memory if this can’t be resolved which fills like a big step backwards. It would have been nice if NVidia had made the new managed memory approach backward compatible in sense of not breaking existing code efficiency - I shouldn’t have to rewrite code to use cudaMemPrefetchAsync everywhere just by a version increase of CUDA. It should be a configurable option of ManagedMemory at the point of allocation of whether it is pre-cached on the GPU or not.
Given the error is not occurring on another Pascal set up. does anyone know how to proceed in this circumstance? i.e. is this likely a configuration problem with Windows 10 and the GTX 1080 or a bad driver (I’m using latest drivers for the 1080)?
What happens if you leave out the cudaSetDevice(0) call?
Also I notice that you never check the err variable. That seems to suggest you’ve been running a debug build in the debugger. What happens if you add error checking code and run a release build without a debugger attached?
The plot thickens…I’ve now got a Titan X Pascal running on a Alienware Amplifier with the laptop. Same error 10 occurs with managed memory. Again using latest NVidia drivers.
it is a BUG of NVIDIA on Windows Systems witch occurs with PASCAL architecture.
I know this since a few days, but could not write it here because i was on vacation without internet connection.
For details see the comments of: https://devblogs.nvidia.com/parallelforall/unified-memory-cuda-beginners/
where Mark Harris from NVIDIA confirms the Bug. It should be corrected with CUDA 9. He also tells that it should be communicated to Microsoft to help the caus. But i don’t found a suitable Microsoft Bug Report Page till now.
Thanks, Nvidia confirmed this as well on my bug report. It is pretty frustrating that NVidia released CUDA 8.0 and the new managed memory framework without even bothering to test it on a Windows box. Especially annoying given the CUDA blog is raving about how great the new managed memory framework is - it was great until CUDA 8.0 broke it.
Managed memory worked perfectly before CUDA 8.0 and because of this bug I had to abandon managed memory and write tons of conversion routines to take c++ objects from cpu to gpu costing me about 2 weeks of development time.
Please NVidia can you test CUDA 9.0 properly before you release??
i am impaired by this bug since 7 months. 2 months i tried to fix it before i reported it as a bug.
But first with the parrallel for all post of Mark Harris i could be sure it is a bug. Nobody posted it anywhere. Only stupid questions if i am sure to have only one graphic card in my system.
At least a patch with a fallback for Pascal GPUs to Cuda 7.5 behavior for ManagedMemory without Paging NVIDIA could offer and a clear statement in the documentation that this feature do not work on windows os. At least i wouldn’t buy expensive usless Pascal cards. Going back to CUDA 7.5 is also no option because of many matrix methods of CUDA 8 i am using.
In the mean while i also affected by additional bugs in CUDA 8.0.61 using CUB instead of thrust, so yes they should test more and do less marketing for new features.
But i would like to have CUDA 9.0 as soon as possible, when the bug is fixed.
I am seeing this as well on Windows 10 version 1709 (OS Build 16299.64) using CUDA 9 (9.0.176) using with 1 x M5000 Quadro card Driver Version 385.54, if I run the same code posted above:
cudaMallocManaged() returns success, kernels execute fine (in other code).
If I add cudaGetDevice() it returns success, with device 0, cudaSetDevice() also returns success.
In both the original case and adding cudaSetDevice(), the result is the same.
cudaMemPrefetchAsync requires a Pascal or newer device. You have a maxwell generation GPU. So that is an invalid device for that function call.
prefetching of memory is associated with the unified memory (UM) model applicable to Pascal and Volta generation gpus (i.e. demand paging). The pre-pascal UM regime does not have a concept of prefetching or paging.
if you query the pageableMemoryAccess property on your GPU, you’ll find that it is not set.
maxGridSize 0x0000002c844ff090 {2147483647, 65535, 65535} int[3]
clockRate 1860000 int
totalConstMem 65536 unsigned __int64
major 6 int
minor 1 int
textureAlignment 512 unsigned __int64
texturePitchAlignment 32 unsigned __int64
deviceOverlap 1 int
multiProcessorCount 20 int
kernelExecTimeoutEnabled 1 int
integrated 0 int
canMapHostMemory 1 int
computeMode 0 int
maxTexture1D 131072 int
maxTexture1DMipmap 16384 int
maxTexture1DLinear 134217728 int
maxSurface2DLayered 0x0000002c844ff160 {32768, 32768, 2048} int[3]
maxSurfaceCubemap 32768 int
maxSurfaceCubemapLayered 0x0000002c844ff170 {32768, 2046} int[2]
surfaceAlignment 512 unsigned __int64
concurrentKernels 1 int
ECCEnabled 0 int
pciBusID 8 int
pciDeviceID 0 int
pciDomainID 0 int
tccDriver 0 int
asyncEngineCount 2 int
unifiedAddressing 1 int
memoryClockRate 5005000 int
memoryBusWidth 256 int
l2CacheSize 2097152 int
maxThreadsPerMultiProcessor 2048 int
streamPrioritiesSupported 1 int
globalL1CacheSupported 1 int
localL1CacheSupported 1 int
sharedMemPerMultiprocessor 98304 unsigned __int64
regsPerMultiprocessor 65536 int
managedMemory 1 int
isMultiGpuBoard 0 int
multiGpuBoardGroupID 0 int
hostNativeAtomicSupported 0 int
singleToDoublePrecisionPerfRatio 32 int
pageableMemoryAccess 0 int
concurrentManagedAccess 0 int
computePreemptionSupported 0 int
canUseHostPointerForRegisteredMem 0 int
cooperativeLaunch 0 int
cooperativeMultiDeviceLaunch 0 int
sharedMemPerBlockOptin 0 unsigned __int64
Are you on Windows by any chance? If so, check the documentation whether demand paging is supported with the WDDM driver. The WDDM driver puts Windows in charge of managing GPU memory, which has all kind of “interesting” ramifications, most of them not favorable to CUDA users.
“Note that currently these features are only supported on Linux operating systems. Applications running on Windows (whether in TCC or WDDM mode) or macOS will use the basic Unified Memory model as on pre-6.x architectures even when they are running on hardware with compute capability 6.x or higher.”
I didn’t try it out yet. But i get an email from nvidia to my original bug report to them. The status of the bug is changed from “Open - Fix being tested” to “Closed - Fixed”