cudaMemPrefetchAsync returns cudaErrorInvalidDevice

Hi,

When I call cudaMemPrefetchAsync it returns cudaErrorInvalidDevice. I’m compiling on CUDA 8.0 with a GTX1080 GPU, and using compile flags compute_60,sm_60. Full sample code below:

Any ideas what’s wrong?

Thanks!

#include "cuda_runtime.h"
#include "device_launch_parameters.h"

#include <stdio.h>
#include <vector>

__global__ static void CUDAKernelAddOneToVector(int *data){
	const int x = blockIdx.x * blockDim.x + threadIdx.x;
	const int y = blockIdx.y * blockDim.y + threadIdx.y;
	const int mx = gridDim.x * blockDim.x;

	data[y * mx + x] = data[y * mx + x] + 1;
}

void AddOneToVector(std::vector<int> &in){
	int *data;
	cudaError_t err = cudaSetDevice(0);

	err = cudaMallocManaged(reinterpret_cast<void **>(&data),
		in.size() * sizeof(int),
		cudaMemAttachGlobal);

	err = cudaMemcpy(data, in.data(), in.size()*sizeof(int), cudaMemcpyHostToHost);

	dim3 blks(in.size() / (16 * 32), 1);
	dim3 threads(32, 16);

	err = cudaMemPrefetchAsync(data, in.size() * sizeof(int), 0, 0);

	CUDAKernelAddOneToVector << <blks, threads >> >(data);

	cudaDeviceSynchronize();

	cudaMemcpy(in.data(), data, in.size()*sizeof(int), cudaMemcpyHostToHost);

	cudaFree(data);
}

int main()
{
	std::vector<int>* in = new std::vector<int>( { 1, 2, 3, 4 } );
	AddOneToVector(*in);
}

Do you have more than 1 GPU in that system? If so, which GPUs are they? Are they all on the same P2P mesh?

Your device is actually a cc6.1 device, so I would recommend compiling for compute_61,sm_61, although I don’t think that is the crux of the issue.

It’s an Alienware 17 laptop with GTX 1080 and an integrated Intel HD Graphics 530. Code runs fine without calling cudaMemPrefetchAsync, but with poor performance when using Managed memory.

Tried compute_61,sm_61 but get the same error. Does the above code run ok for anyone else with a Pascal architecture, i.e. does err = cudaMemPrefetchAsync(data, in.size() * sizeof(int), 0, 0) return 0?

It runs fine for me on Ubuntu 14.04, CUDA 8.0, Titan X Pascal. That function call returns 0.

Thanks for checking.

It’s frustrating as I have a large object orientated project that depends heavily on managed memory being fast, and the changes under CUDA 8.0 / Pascal have reduced performance back to CPU speed because the memory is not being pre-cached on the GPU. I’m facing a big redesign to avoid managed memory if this can’t be resolved which fills like a big step backwards. It would have been nice if NVidia had made the new managed memory approach backward compatible in sense of not breaking existing code efficiency - I shouldn’t have to rewrite code to use cudaMemPrefetchAsync everywhere just by a version increase of CUDA. It should be a configurable option of ManagedMemory at the point of allocation of whether it is pre-cached on the GPU or not.

Given the error is not occurring on another Pascal set up. does anyone know how to proceed in this circumstance? i.e. is this likely a configuration problem with Windows 10 and the GTX 1080 or a bad driver (I’m using latest drivers for the 1080)?

What happens if you leave out the cudaSetDevice(0) call?

Also I notice that you never check the err variable. That seems to suggest you’ve been running a debug build in the debugger. What happens if you add error checking code and run a release build without a debugger attached?

Sadly doesn’t make any difference. Running release of below errors on the cudaMemPrefetchAsync call.

#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <stdio.h>
int main()
{
	int* data;
	size_t len = 10;
	cudaError_t err = cudaMallocManaged(reinterpret_cast<void **>(&data), len, cudaMemAttachGlobal);
	if (err != 0) { printf("Error calling cudaMallocManaged"); }
	err = cudaMemPrefetchAsync(data, len, 0, 0);
	if (err != 0) { printf("Error calling cudaMemPrefetchAsync"); }
}

The plot thickens…I’ve now got a Titan X Pascal running on a Alienware Amplifier with the laptop. Same error 10 occurs with managed memory. Again using latest NVidia drivers.

Hi,

it is a BUG of NVIDIA on Windows Systems witch occurs with PASCAL architecture.

I know this since a few days, but could not write it here because i was on vacation without internet connection.

For details see the comments of: https://devblogs.nvidia.com/parallelforall/unified-memory-cuda-beginners/
where Mark Harris from NVIDIA confirms the Bug. It should be corrected with CUDA 9. He also tells that it should be communicated to Microsoft to help the caus. But i don’t found a suitable Microsoft Bug Report Page till now.

1 Like

Thanks, Nvidia confirmed this as well on my bug report. It is pretty frustrating that NVidia released CUDA 8.0 and the new managed memory framework without even bothering to test it on a Windows box. Especially annoying given the CUDA blog is raving about how great the new managed memory framework is - it was great until CUDA 8.0 broke it.

Managed memory worked perfectly before CUDA 8.0 and because of this bug I had to abandon managed memory and write tons of conversion routines to take c++ objects from cpu to gpu costing me about 2 weeks of development time.

Please NVidia can you test CUDA 9.0 properly before you release??

Hi John_Smith_Lon,

i am impaired by this bug since 7 months. 2 months i tried to fix it before i reported it as a bug.
But first with the parrallel for all post of Mark Harris i could be sure it is a bug. Nobody posted it anywhere. Only stupid questions if i am sure to have only one graphic card in my system.

At least a patch with a fallback for Pascal GPUs to Cuda 7.5 behavior for ManagedMemory without Paging NVIDIA could offer and a clear statement in the documentation that this feature do not work on windows os. At least i wouldn’t buy expensive usless Pascal cards. Going back to CUDA 7.5 is also no option because of many matrix methods of CUDA 8 i am using.

In the mean while i also affected by additional bugs in CUDA 8.0.61 using CUB instead of thrust, so yes they should test more and do less marketing for new features.

But i would like to have CUDA 9.0 as soon as possible, when the bug is fixed.

I am seeing this as well on Windows 10 version 1709 (OS Build 16299.64) using CUDA 9 (9.0.176) using with 1 x M5000 Quadro card Driver Version 385.54, if I run the same code posted above:

#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <stdio.h>
int main()
{
	int* data;
	size_t len = 10;
	cudaError_t err = cudaMallocManaged(reinterpret_cast<void **>(&data), len, cudaMemAttachGlobal);
	if (err != 0) { printf("Error calling cudaMallocManaged"); }
	err = cudaMemPrefetchAsync(data, len, 0, 0);
	if (err != 0) { printf("Error calling cudaMemPrefetchAsync"); }
}

cudaMallocManaged() returns success, kernels execute fine (in other code).
If I add cudaGetDevice() it returns success, with device 0, cudaSetDevice() also returns success.

In both the original case and adding cudaSetDevice(), the result is the same.

cudaMemPrefetchAsync() always returns 10 (cudaErrorInvalidDevice)

Thoughts?

cudaMemPrefetchAsync requires a Pascal or newer device. You have a maxwell generation GPU. So that is an invalid device for that function call.

prefetching of memory is associated with the unified memory (UM) model applicable to Pascal and Volta generation gpus (i.e. demand paging). The pre-pascal UM regime does not have a concept of prefetching or paging.

if you query the pageableMemoryAccess property on your GPU, you’ll find that it is not set.

http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#um-data-migration

While unfortunate, thank you kindly for the quick and detailed response!

Using a GTX1080 I also see the paging flag as false, which I would not expect.

if( cudaGetDeviceProperties( &deviceProp, deviceNum ) == cudaSuccess )
...

deviceProp shows pageableMemoryAccess as 0

  •   deviceProp	{name=0x0000002c844fef60 "GeForce GTX 1080" totalGlobalMem=8589934592 sharedMemPerBlock=49152 ...}	cudaDeviceProp
    
  •   name	0x0000002c844fef60 "GeForce GTX 1080"	char[256]
      totalGlobalMem	8589934592	unsigned __int64
      sharedMemPerBlock	49152	unsigned __int64
      regsPerBlock	65536	int
      warpSize	32	int
      memPitch	2147483647	unsigned __int64
      maxThreadsPerBlock	1024	int
    
  •   maxThreadsDim	0x0000002c844ff084 {1024, 1024, 64}	int[3]
    
  •   maxGridSize	0x0000002c844ff090 {2147483647, 65535, 65535}	int[3]
      clockRate	1860000	int
      totalConstMem	65536	unsigned __int64
      major	6	int
      minor	1	int
      textureAlignment	512	unsigned __int64
      texturePitchAlignment	32	unsigned __int64
      deviceOverlap	1	int
      multiProcessorCount	20	int
      kernelExecTimeoutEnabled	1	int
      integrated	0	int
      canMapHostMemory	1	int
      computeMode	0	int
      maxTexture1D	131072	int
      maxTexture1DMipmap	16384	int
      maxTexture1DLinear	134217728	int
    
  •   maxTexture2D	0x0000002c844ff0e4 {131072, 65536}	int[2]
    
  •   maxTexture2DMipmap	0x0000002c844ff0ec {32768, 32768}	int[2]
    
  •   maxTexture2DLinear	0x0000002c844ff0f4 {131072, 65000, 2097120}	int[3]
    
  •   maxTexture2DGather	0x0000002c844ff100 {32768, 32768}	int[2]
    
  •   maxTexture3D	0x0000002c844ff108 {16384, 16384, 16384}	int[3]
    
  •   maxTexture3DAlt	0x0000002c844ff114 {8192, 8192, 32768}	int[3]
      maxTextureCubemap	32768	int
    
  •   maxTexture1DLayered	0x0000002c844ff124 {32768, 2048}	int[2]
    
  •   maxTexture2DLayered	0x0000002c844ff12c {32768, 32768, 2048}	int[3]
    
  •   maxTextureCubemapLayered	0x0000002c844ff138 {32768, 2046}	int[2]
      maxSurface1D	32768	int
    
  •   maxSurface2D	0x0000002c844ff144 {131072, 65536}	int[2]
    
  •   maxSurface3D	0x0000002c844ff14c {16384, 16384, 16384}	int[3]
    
  •   maxSurface1DLayered	0x0000002c844ff158 {32768, 2048}	int[2]
    
  •   maxSurface2DLayered	0x0000002c844ff160 {32768, 32768, 2048}	int[3]
      maxSurfaceCubemap	32768	int
    
  •   maxSurfaceCubemapLayered	0x0000002c844ff170 {32768, 2046}	int[2]
      surfaceAlignment	512	unsigned __int64
      concurrentKernels	1	int
      ECCEnabled	0	int
      pciBusID	8	int
      pciDeviceID	0	int
      pciDomainID	0	int
      tccDriver	0	int
      asyncEngineCount	2	int
      unifiedAddressing	1	int
      memoryClockRate	5005000	int
      memoryBusWidth	256	int
      l2CacheSize	2097152	int
      maxThreadsPerMultiProcessor	2048	int
      streamPrioritiesSupported	1	int
      globalL1CacheSupported	1	int
      localL1CacheSupported	1	int
      sharedMemPerMultiprocessor	98304	unsigned __int64
      regsPerMultiprocessor	65536	int
      managedMemory	1	int
      isMultiGpuBoard	0	int
      multiGpuBoardGroupID	0	int
      hostNativeAtomicSupported	0	int
      singleToDoublePrecisionPerfRatio	32	int
      pageableMemoryAccess	0	int
      concurrentManagedAccess	0	int
      computePreemptionSupported	0	int
      canUseHostPointerForRegisteredMem	0	int
      cooperativeLaunch	0	int
      cooperativeMultiDeviceLaunch	0	int
      sharedMemPerBlockOptin	0	unsigned __int64
    

Perhaps I am still missing something?

Are you on Windows by any chance? If so, check the documentation whether demand paging is supported with the WDDM driver. The WDDM driver puts Windows in charge of managing GPU memory, which has all kind of “interesting” ramifications, most of them not favorable to CUDA users.

Once more unto the docs:

http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#um-requirements

“Note that currently these features are only supported on Linux operating systems. Applications running on Windows (whether in TCC or WDDM mode) or macOS will use the basic Unified Memory model as on pre-6.x architectures even when they are running on hardware with compute capability 6.x or higher.”

Yup, Windows 10. Thanks for the info.

Does anyone know when unified memory will finally be supported on Windows? It seems pretty sad this great feature is denied to a common platform.

Posssibly it will work now with Cuda 9.1.

I didn’t try it out yet. But i get an email from nvidia to my original bug report to them. The status of the bug is changed from “Open - Fix being tested” to “Closed - Fixed”