cudaHostGetDevicePointer() and Zero-Copy

Hi there,

I’ve just started working with the 2.2 beta toolkit, and have a question about the new zero-copy access. How can I detect if it’s available for the current device or not at run-time so that I can either use it, or back off to the more traditional MemcopyAsync? I’ve tried performing a cudaHostAlloc() followed by a cudaHostGetDevicePointer(), but both always return cudaSuccess.

Thanks,
Peter

Check the canMapHostMemory member of the cudaDeviceProp structure - it will be nonzero if the device can map pinned system memory.

Driver API apps may call cuDeviceGetAttribute with CU_DEVICE_ATTRIBUTE_CAN_MAP_HOST_MEMORY.

I’m afraid that the 32-bit 2.2 beta toolkit version of cudaDeviceProp does not have a canMapHostMemory member (or anything close to it). Is this something in the 64-bit toolkit? Also, the CUdevice_attribute type does not have a CU_DEVICE_ATTRIBUTE_CAN_MAP_HOST_MEMORY member. Is this a feature that will be coming out with the final version of the toolkit? If so I can wait for it.

Finally, is the zero-copy feature pinned to a specific CUDA device version like 1.3? I can also check the version, but I did not see it in the features list.

Thanks,

Peter

Very weird. This flag is definitely in the linux 64-bit toolkit: line 268 in cuda/include/driver_types.h. I haven’t actually tried if it works, though.

On normal discrete GPUs, only G200 supports zero-copy. Some of the integrated chips do support it, too: see http://forums.nvidia.com/index.php?showtopic=92290

In the 32-bit Windows driver_types.h, line 268 is in the __cudaReserved[38] line:

/**

  • CUDA device properties

*/

/DEVICE_BUILTIN/

struct cudaDeviceProp

{

char name[256]; ///< ASCII string identifying device

size_t totalGlobalMem; ///< Global memory available on device in bytes

size_t sharedMemPerBlock; ///< Shared memory available per block in bytes

int regsPerBlock; ///< 32-bit registers available per block

int warpSize; ///< Warp size in threads

size_t memPitch; ///< Maximum pitch in bytes allowed by memory copies

int maxThreadsPerBlock; ///< Maximum number of threads per block

int maxThreadsDim[3]; ///< Maximum size of each dimension of a block

int maxGridSize[3]; ///< Maximum size of each dimension of a grid

int clockRate; ///< Clock frequency in kilohertz

size_t totalConstMem; ///< Constant memory available on device in bytes

int major; ///< Major compute capability

int minor; ///< Minor compute capability

size_t textureAlignment; ///< Alignment requirement for textures

int deviceOverlap; ///< Device can concurrently copy memory and execute a kernel

int multiProcessorCount; ///< Number of multiprocessors on device

int kernelExecTimeoutEnabled; ///< Specified whether there is a run time limit on kernels

int integrated; ///< Device is integrated as opposed to discrete

int __cudaReserved[38];

};

Hopefully, NVIDIA has fixed the problem in 2.2 final. If not, hopefully Tim is reading this and double checks it to make sure before release :)

Yes, it is fixed in final.

In 2.2 beta, you only have the integrated device property–e.g., “is this MCP79 and can therefore do copy elimination as part of zero-copy.” (see the big zero-copy thread in the other forum if copy elimination doesn’t mean anything to you)

We noticed the glaring oversight too late for 2.2 beta, but in final there is also canMapHostMemory, which will be true on MCP79 + Compute 1.2 or greater.