cudaMemPrefetchAsync returns cudaErrorInvalidDevice

John_Smith_Lon · December 10, 2016, 8:53am

Hi,

When I call cudaMemPrefetchAsync it returns cudaErrorInvalidDevice. I’m compiling on CUDA 8.0 with a GTX1080 GPU, and using compile flags compute_60,sm_60. Full sample code below:

Any ideas what’s wrong?

Thanks!

#include "cuda_runtime.h"
#include "device_launch_parameters.h"

#include <stdio.h>
#include <vector>

__global__ static void CUDAKernelAddOneToVector(int *data){
	const int x = blockIdx.x * blockDim.x + threadIdx.x;
	const int y = blockIdx.y * blockDim.y + threadIdx.y;
	const int mx = gridDim.x * blockDim.x;

	data[y * mx + x] = data[y * mx + x] + 1;
}

void AddOneToVector(std::vector<int> &in){
	int *data;
	cudaError_t err = cudaSetDevice(0);

	err = cudaMallocManaged(reinterpret_cast<void **>(&data),
		in.size() * sizeof(int),
		cudaMemAttachGlobal);

	err = cudaMemcpy(data, in.data(), in.size()*sizeof(int), cudaMemcpyHostToHost);

	dim3 blks(in.size() / (16 * 32), 1);
	dim3 threads(32, 16);

	err = cudaMemPrefetchAsync(data, in.size() * sizeof(int), 0, 0);

	CUDAKernelAddOneToVector << <blks, threads >> >(data);

	cudaDeviceSynchronize();

	cudaMemcpy(in.data(), data, in.size()*sizeof(int), cudaMemcpyHostToHost);

	cudaFree(data);
}

int main()
{
	std::vector<int>* in = new std::vector<int>( { 1, 2, 3, 4 } );
	AddOneToVector(*in);
}

Robert_Crovella · December 10, 2016, 2:26pm

Do you have more than 1 GPU in that system? If so, which GPUs are they? Are they all on the same P2P mesh?

Your device is actually a cc6.1 device, so I would recommend compiling for compute_61,sm_61, although I don’t think that is the crux of the issue.

John_Smith_Lon · December 10, 2016, 11:21pm

It’s an Alienware 17 laptop with GTX 1080 and an integrated Intel HD Graphics 530. Code runs fine without calling cudaMemPrefetchAsync, but with poor performance when using Managed memory.

Tried compute_61,sm_61 but get the same error. Does the above code run ok for anyone else with a Pascal architecture, i.e. does err = cudaMemPrefetchAsync(data, in.size() * sizeof(int), 0, 0) return 0?

Robert_Crovella · December 11, 2016, 12:01am

It runs fine for me on Ubuntu 14.04, CUDA 8.0, Titan X Pascal. That function call returns 0.

John_Smith_Lon · December 11, 2016, 5:19pm

Thanks for checking.

It’s frustrating as I have a large object orientated project that depends heavily on managed memory being fast, and the changes under CUDA 8.0 / Pascal have reduced performance back to CPU speed because the memory is not being pre-cached on the GPU. I’m facing a big redesign to avoid managed memory if this can’t be resolved which fills like a big step backwards. It would have been nice if NVidia had made the new managed memory approach backward compatible in sense of not breaking existing code efficiency - I shouldn’t have to rewrite code to use cudaMemPrefetchAsync everywhere just by a version increase of CUDA. It should be a configurable option of ManagedMemory at the point of allocation of whether it is pre-cached on the GPU or not.

Given the error is not occurring on another Pascal set up. does anyone know how to proceed in this circumstance? i.e. is this likely a configuration problem with Windows 10 and the GTX 1080 or a bad driver (I’m using latest drivers for the 1080)?

tera · December 12, 2016, 12:29pm

What happens if you leave out the cudaSetDevice(0) call?

Also I notice that you never check the err variable. That seems to suggest you’ve been running a debug build in the debugger. What happens if you add error checking code and run a release build without a debugger attached?

John_Smith_Lon · December 12, 2016, 12:44pm

Sadly doesn’t make any difference. Running release of below errors on the cudaMemPrefetchAsync call.

#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <stdio.h>
int main()
{
	int* data;
	size_t len = 10;
	cudaError_t err = cudaMallocManaged(reinterpret_cast<void **>(&data), len, cudaMemAttachGlobal);
	if (err != 0) { printf("Error calling cudaMallocManaged"); }
	err = cudaMemPrefetchAsync(data, len, 0, 0);
	if (err != 0) { printf("Error calling cudaMemPrefetchAsync"); }
}

John_Smith_Lon · December 15, 2016, 8:57pm

The plot thickens…I’ve now got a Titan X Pascal running on a Alienware Amplifier with the laptop. Same error 10 occurs with managed memory. Again using latest NVidia drivers.

19CS19 · July 10, 2017, 10:41am

Hi,

it is a BUG of NVIDIA on Windows Systems witch occurs with PASCAL architecture.

I know this since a few days, but could not write it here because i was on vacation without internet connection.

For details see the comments of: https://devblogs.nvidia.com/parallelforall/unified-memory-cuda-beginners/
where Mark Harris from NVIDIA confirms the Bug. It should be corrected with CUDA 9. He also tells that it should be communicated to Microsoft to help the caus. But i don’t found a suitable Microsoft Bug Report Page till now.

John_Smith_Lon · July 10, 2017, 8:15pm

Thanks, Nvidia confirmed this as well on my bug report. It is pretty frustrating that NVidia released CUDA 8.0 and the new managed memory framework without even bothering to test it on a Windows box. Especially annoying given the CUDA blog is raving about how great the new managed memory framework is - it was great until CUDA 8.0 broke it.

Managed memory worked perfectly before CUDA 8.0 and because of this bug I had to abandon managed memory and write tons of conversion routines to take c++ objects from cpu to gpu costing me about 2 weeks of development time.

Please NVidia can you test CUDA 9.0 properly before you release??

19CS19 · July 11, 2017, 6:35am

Hi John_Smith_Lon,

i am impaired by this bug since 7 months. 2 months i tried to fix it before i reported it as a bug.
But first with the parrallel for all post of Mark Harris i could be sure it is a bug. Nobody posted it anywhere. Only stupid questions if i am sure to have only one graphic card in my system.

At least a patch with a fallback for Pascal GPUs to Cuda 7.5 behavior for ManagedMemory without Paging NVIDIA could offer and a clear statement in the documentation that this feature do not work on windows os. At least i wouldn’t buy expensive usless Pascal cards. Going back to CUDA 7.5 is also no option because of many matrix methods of CUDA 8 i am using.

In the mean while i also affected by additional bugs in CUDA 8.0.61 using CUB instead of thrust, so yes they should test more and do less marketing for new features.

But i would like to have CUDA 9.0 as soon as possible, when the bug is fixed.

RS64 · December 4, 2017, 4:47pm

I am seeing this as well on Windows 10 version 1709 (OS Build 16299.64) using CUDA 9 (9.0.176) using with 1 x M5000 Quadro card Driver Version 385.54, if I run the same code posted above:

#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <stdio.h>
int main()
{
	int* data;
	size_t len = 10;
	cudaError_t err = cudaMallocManaged(reinterpret_cast<void **>(&data), len, cudaMemAttachGlobal);
	if (err != 0) { printf("Error calling cudaMallocManaged"); }
	err = cudaMemPrefetchAsync(data, len, 0, 0);
	if (err != 0) { printf("Error calling cudaMemPrefetchAsync"); }
}

cudaMallocManaged() returns success, kernels execute fine (in other code).
If I add cudaGetDevice() it returns success, with device 0, cudaSetDevice() also returns success.

In both the original case and adding cudaSetDevice(), the result is the same.

cudaMemPrefetchAsync() always returns 10 (cudaErrorInvalidDevice)

Thoughts?

Robert_Crovella · December 4, 2017, 5:15pm

cudaMemPrefetchAsync requires a Pascal or newer device. You have a maxwell generation GPU. So that is an invalid device for that function call.

prefetching of memory is associated with the unified memory (UM) model applicable to Pascal and Volta generation gpus (i.e. demand paging). The pre-pascal UM regime does not have a concept of prefetching or paging.

if you query the pageableMemoryAccess property on your GPU, you’ll find that it is not set.

[url]Programming Guide :: CUDA Toolkit Documentation

RS64 · December 4, 2017, 6:13pm

While unfortunate, thank you kindly for the quick and detailed response!

RS64 · December 6, 2017, 4:45pm

Using a GTX1080 I also see the paging flag as false, which I would not expect.

if( cudaGetDeviceProperties( &deviceProp, deviceNum ) == cudaSuccess )
...

deviceProp shows pageableMemoryAccess as 0

  deviceProp	{name=0x0000002c844fef60 "GeForce GTX 1080" totalGlobalMem=8589934592 sharedMemPerBlock=49152 ...}	cudaDeviceProp

  name	0x0000002c844fef60 "GeForce GTX 1080"	char[256]
  totalGlobalMem	8589934592	unsigned __int64
  sharedMemPerBlock	49152	unsigned __int64
  regsPerBlock	65536	int
  warpSize	32	int
  memPitch	2147483647	unsigned __int64
  maxThreadsPerBlock	1024	int

  maxThreadsDim	0x0000002c844ff084 {1024, 1024, 64}	int[3]

  maxGridSize	0x0000002c844ff090 {2147483647, 65535, 65535}	int[3]
  clockRate	1860000	int
  totalConstMem	65536	unsigned __int64
  major	6	int
  minor	1	int
  textureAlignment	512	unsigned __int64
  texturePitchAlignment	32	unsigned __int64
  deviceOverlap	1	int
  multiProcessorCount	20	int
  kernelExecTimeoutEnabled	1	int
  integrated	0	int
  canMapHostMemory	1	int
  computeMode	0	int
  maxTexture1D	131072	int
  maxTexture1DMipmap	16384	int
  maxTexture1DLinear	134217728	int

  maxTexture2D	0x0000002c844ff0e4 {131072, 65536}	int[2]

  maxTexture2DMipmap	0x0000002c844ff0ec {32768, 32768}	int[2]

  maxTexture2DLinear	0x0000002c844ff0f4 {131072, 65000, 2097120}	int[3]

  maxTexture2DGather	0x0000002c844ff100 {32768, 32768}	int[2]

  maxTexture3D	0x0000002c844ff108 {16384, 16384, 16384}	int[3]

  maxTexture3DAlt	0x0000002c844ff114 {8192, 8192, 32768}	int[3]
  maxTextureCubemap	32768	int

  maxTexture1DLayered	0x0000002c844ff124 {32768, 2048}	int[2]

  maxTexture2DLayered	0x0000002c844ff12c {32768, 32768, 2048}	int[3]

  maxTextureCubemapLayered	0x0000002c844ff138 {32768, 2046}	int[2]
  maxSurface1D	32768	int

  maxSurface2D	0x0000002c844ff144 {131072, 65536}	int[2]

  maxSurface3D	0x0000002c844ff14c {16384, 16384, 16384}	int[3]

  maxSurface1DLayered	0x0000002c844ff158 {32768, 2048}	int[2]

  maxSurface2DLayered	0x0000002c844ff160 {32768, 32768, 2048}	int[3]
  maxSurfaceCubemap	32768	int

  maxSurfaceCubemapLayered	0x0000002c844ff170 {32768, 2046}	int[2]
  surfaceAlignment	512	unsigned __int64
  concurrentKernels	1	int
  ECCEnabled	0	int
  pciBusID	8	int
  pciDeviceID	0	int
  pciDomainID	0	int
  tccDriver	0	int
  asyncEngineCount	2	int
  unifiedAddressing	1	int
  memoryClockRate	5005000	int
  memoryBusWidth	256	int
  l2CacheSize	2097152	int
  maxThreadsPerMultiProcessor	2048	int
  streamPrioritiesSupported	1	int
  globalL1CacheSupported	1	int
  localL1CacheSupported	1	int
  sharedMemPerMultiprocessor	98304	unsigned __int64
  regsPerMultiprocessor	65536	int
  managedMemory	1	int
  isMultiGpuBoard	0	int
  multiGpuBoardGroupID	0	int
  hostNativeAtomicSupported	0	int
  singleToDoublePrecisionPerfRatio	32	int
  pageableMemoryAccess	0	int
  concurrentManagedAccess	0	int
  computePreemptionSupported	0	int
  canUseHostPointerForRegisteredMem	0	int
  cooperativeLaunch	0	int
  cooperativeMultiDeviceLaunch	0	int
  sharedMemPerBlockOptin	0	unsigned __int64

Perhaps I am still missing something?

njuffa · December 6, 2017, 5:41pm

Are you on Windows by any chance? If so, check the documentation whether demand paging is supported with the WDDM driver. The WDDM driver puts Windows in charge of managing GPU memory, which has all kind of “interesting” ramifications, most of them not favorable to CUDA users.

Robert_Crovella · December 6, 2017, 5:52pm

Once more unto the docs:

“Note that currently these features are only supported on Linux operating systems. Applications running on Windows (whether in TCC or WDDM mode) or macOS will use the basic Unified Memory model as on pre-6.x architectures even when they are running on hardware with compute capability 6.x or higher.”

RS64 · December 6, 2017, 6:09pm

Yup, Windows 10. Thanks for the info.

John_Smith_Lon · January 4, 2018, 12:28am

Hi John_Smith_Lon,

i am impaired by this bug since 7 months. 2 months i tried to fix it before i reported it as a bug.
But first with the parrallel for all post of Mark Harris i could be sure it is a bug. Nobody posted it anywhere. Only stupid questions if i am sure to have only one graphic card in my system.

At least a patch with a fallback for Pascal GPUs to Cuda 7.5 behavior for ManagedMemory without Paging NVIDIA could offer and a clear statement in the documentation that this feature do not work on windows os. At least i wouldn’t buy expensive usless Pascal cards. Going back to CUDA 7.5 is also no option because of many matrix methods of CUDA 8 i am using.

In the mean while i also affected by additional bugs in CUDA 8.0.61 using CUB instead of thrust, so yes they should test more and do less marketing for new features.

But i would like to have CUDA 9.0 as soon as possible, when the bug is fixed.

Does anyone know when unified memory will finally be supported on Windows? It seems pretty sad this great feature is denied to a common platform.

19CS19 · January 4, 2018, 7:25am

Posssibly it will work now with Cuda 9.1.

I didn’t try it out yet. But i get an email from nvidia to my original bug report to them. The status of the bug is changed from “Open - Fix being tested” to “Closed - Fixed”

Topic		Replies	Views
Always got this warning when nvprof cuda file "This can happen if device ran out of memory or if a device kernel was stopped due to an assertion" on just HellowWorld GPU CUDA Programming and Performance	9	2557	January 31, 2019
Unified Memory for CUDA Beginners Technical Blog	46	2564	December 1, 2023
CUDA very slow performance CUDA Programming and Performance	21	16740	March 6, 2020
CUDA 8 Features Revealed Technical Blog	51	863	November 8, 2018
why "all CUDA-capable devices are busy or unavailable" ? CUDA Programming and Performance	34	64341	April 20, 2011
CUDA 2.1 discussion CUDA Programming and Performance	71	63941	February 17, 2009
why cudaGetDeviceProperties and cudaMallocPitch consume a lot of time CUDA Programming and Performance	18	2366	January 9, 2017
deviceQuery passes and then fails CUDA Setup and Installation	4	2150	July 6, 2016
Install Problem CUDA Programming and Performance	32	12706	December 17, 2009
CUDA Toolkit 3.0 released CUDA Programming and Performance	62	26031	September 21, 2010

cudaMemPrefetchAsync returns cudaErrorInvalidDevice

Related topics