cudaHostAlloc not working cudaHostAlloc returning CUDA_ERROR_INVALID_IMAGE

Hello,

First - apologies if this should be in the forum for CUDA on macs - I’m not sure if this is a programming error or a mac problem.

I’m attempting to use cudaHostAlloc to allocate host memory. Specifically, I’m trying to use the flag cudaHostAllocMapped to map host memory to the GPU - but I get the same error no matter what flags I pass to cudaHostAlloc.

The error I get when trying to use cudaHostAlloc is

cudaError_t CUDA_ERROR_INVALID_IMAGE

The error message string says ‘device kernel image is invalid’. Does anyone have a clue what this means? Googling for the error reveals no useful information at all.

Some details:

  • Using C runtime, not the driver API

  • Using a Macbook Pro. OSX 10.6.5, 2.66Ghx Core2 Duo, 4Gb 1067Mhz DDR ram, GeForce 320M.

  • NVIDIA CUDA 3.2 [latest version at time of writing]

  • CUDA Driver Version: 3.2.17, GPU Driver Version: 1.6.24.17 (256.00.15f04) [latest at time of writing]

  • All the SDK examples compile and work correctly.

  • An especial point of note is that the bandwidthTest (which uses cudaHostAllocMappedworks both with and without the pinned memory option.

  • I’m using MATLAB to run it (I have to - MATLAB precomputes the input data), so my CUDA code is a c MEX file.

  • Using the dynamic cuda runtime library (don’t knowthe syntax to compile/link a static one - tips appreciated)

  • Ouput from deviceQuery:

CUDA Device Query (Runtime API) version (CUDART static linking)

There is 1 device supporting CUDA

Device 0: “GeForce 320M”

CUDA Driver Version: 3.20

CUDA Runtime Version: 3.20

CUDA Capability Major/Minor version number: 1.2

Total amount of global memory: 265027584 bytes

Multiprocessors x Cores/MP = Cores: 6 (MP) x 8 (Cores/MP) = 48 (Cores)

Total amount of constant memory: 65536 bytes

Total amount of shared memory per block: 16384 bytes

Total number of registers available per block: 16384

Warp size: 32

Maximum number of threads per block: 512

Maximum sizes of each dimension of a block: 512 x 512 x 64

Maximum sizes of each dimension of a grid: 65535 x 65535 x 1

Maximum memory pitch: 2147483647 bytes

Texture alignment: 256 bytes

Clock rate: 0.95 GHz

Concurrent copy and execution: Yes

Run time limit on kernels: Yes

Integrated: Yes

Support host page-locked memory mapping: Yes

Compute mode: Default (multiple host threads can use this device simultaneously)

Concurrent kernel execution: No

Device has ECC support enabled: No

Device is using TCC driver mode: No

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 3.20, CUDA Runtime Version = 3.20, NumDevs = 1, Device = GeForce 320M

This could be a whole load of things, but I think there are four possible sources of the problem:

  1. Bug in CUDA which only occurs on this device

  2. Launching this from within another process (MATLAB) interferes with cudaHostAlloc somehow

  3. Bug in my code

  4. Incorrect CUDA libraries being used

Does anyone have any insight into which if these is most likely to be the problem? Or if it’s something else entirely?

I’ll include my code. Sorry, it’s possibly a bit confusing as there’s the MATLAB MEX interface too…

Thanks for any help you can offer

  • Tom Clark
/* 

 *      This is a mex file to perform batch 3d ffts in MATLAB

 *

 */

#include "math.h"

#include "mex.h"

#include "cufft.h"

#include "cuda_runtime.h"

#include "matrix.h"

#include "stdlib.h"

#include "cutil_inline.h"

// Quick function to transfer data from MATLAB to Host memory

void transferMxToHost(	float *output_float, 

              			float *input_float, 

              			int Ntot)

{

    int i;

    for (i = 0; i < Ntot; i++) 

	      {

               output_float[i] = input_float[i];

	      }

}

void mexFunction( int nlhs, mxArray *plhs[],

                  int nrhs, const mxArray *prhs[])

{

/* Declare variables */  

	int M, N, P, batch, nDims;

    int *batchPtr;

    int fftSizeArray[3];

	float *inputPtr, *outputPtr; // Pointers to matlab mxArray contents

	float *h_input;              // Host pointer for mapped memory

	cufftReal *d_input;          // Device pointer for mapped memory

	cufftComplex *d_inter;       // Device pointer for intermediate array

	cufftHandle planForward;

	cufftHandle planInverse;

	const mwSize *dimArray;

    cudaDeviceProp properties;

/* Declare error flags */

	cufftResult fft_err = CUFFT_SUCCESS;

	cudaError_t cuda_err = cudaSuccess;

// Check inputs OMITTED FOR CLARITY

// Get array size in 3 dimensions

    dimArray = mxGetDimensions(prhs[0]);

    M = dimArray[0]; // number of rows in the input mxArray.

    N = dimArray[1]; // number of cols in the input mxArray.

    P = dimArray[2]; // number of pages in the input mxArray.

// Get number of batches

    batchPtr = (int *)mxGetData(prhs[1]);

    batch = batchPtr[0];

// Get third dimension of transform size

    if (( P % batch ) == 0){

        P = P/batch; // N.B. integer division

    }else{

        mexErrMsgTxt("Third dimension of input array size must be an integer multiple of the number of batches.");

    }

// Check, set and print device properties

    int device = 0;

    cudaGetDeviceProperties(&properties, device);

	if (!properties.canMapHostMemory){

	    mexPrintf("    mex_batchfft3: Cannot run: GPU Incapable of mapping host memory. Check NVIDIA website for details");

	    mexErrMsgTxt("Unable to map host memory to device");

	} else {

	    cuda_err = cudaSetDeviceFlags(cudaDeviceMapHost);

        if (cuda_err != cudaSuccess){

            mexPrintf("        CUDA Runtime API Error reported : %s\n", cudaGetErrorString(cuda_err));

            mexErrMsgTxt("Failed to set device flag 'cudaDeviceMapHost'");

        }

	}

	

	// NB Set the device AFTER setting the cudaDeviceMapHost flag

    cuda_err = cudaSetDevice(device);

    if (cuda_err != cudaSuccess){

        mexPrintf("        CUDA Runtime API Error reported : %s\n", cudaGetErrorString(cuda_err));

        mexErrMsgTxt("Failed to set active device");

    }

	

    // Allocate local (host) memory to hold real input data, mapped for transfer between host and device

// **** THIS IS WHERE THE ERROR OCCURS ****

cuda_err = cudaHostAlloc((void**)&h_input, N*M*P*sizeof(float)*batch, cudaHostAllocMapped);

    if (cuda_err != cudaSuccess){

        mexPrintf("        CUDA Runtime API Error reported : %s\n", cudaGetErrorString(cuda_err));

        mexErrMsgTxt("Failed to allocate host memory");

    }

	// Pointer for the real, single precision input data array

    inputPtr = (float *) mxGetData(prhs[0]);

mexPrintf("    mex_batchfft3: Retrieved Input Pointer\n");

	

    // Copy Data from MATLAB mxArray to local (host) memory-mapped array.

    transferMxToHost((float *)h_input,(float *)inputPtr,N*M*P*batch);

    mexPrintf("    mex_batchfft3: Copied data from MATLAB mxArray to local host device-mapped memory\n");

	// REST OF CODE OMITTED FOR CLARITY

	

    // Return control to MATLAB

    return;

}

edit: changed code a little to more clearly reflect the problem.

The problem is not related to MacOSX, I don’t think it is possible to call cudaMallocHost or cudaHostAlloc in a mex file.

Inside a mex file, you need to use mxalloc.

Ah-ha!

Thankyou mfatica, after googling around the mxMalloc vs cudaHostAlloc issue I found this very good thread:

http://forums.nvidia.com/index.php?showtopic=70731

It seems that you’re right. The reason is that MATLAB won’t allow a different function to allocate host memory, The cited reason is the possibility of memory leaks and unsafe deallocation etc.

It’s annoying though, as surely you should be allowed to take the risk if you want to, and do the deallocation and ‘good housekeeping’ yourself. In this case, it’ll impose a severe penalty on me as I now need to run it on a GPU with 4gb memory (>£1000) instead of a less expensive one. That’s before I even mention the week or two extra compute time it’ll take to process all my data…

Thanks for your quick response, and kind regards

Tom Clark