Hello,
First - apologies if this should be in the forum for CUDA on macs - I’m not sure if this is a programming error or a mac problem.
I’m attempting to use cudaHostAlloc to allocate host memory. Specifically, I’m trying to use the flag cudaHostAllocMapped to map host memory to the GPU - but I get the same error no matter what flags I pass to cudaHostAlloc.
The error I get when trying to use cudaHostAlloc is
cudaError_t CUDA_ERROR_INVALID_IMAGE
The error message string says ‘device kernel image is invalid’. Does anyone have a clue what this means? Googling for the error reveals no useful information at all.
Some details:
-
Using C runtime, not the driver API
-
Using a Macbook Pro. OSX 10.6.5, 2.66Ghx Core2 Duo, 4Gb 1067Mhz DDR ram, GeForce 320M.
-
NVIDIA CUDA 3.2 [latest version at time of writing]
-
CUDA Driver Version: 3.2.17, GPU Driver Version: 1.6.24.17 (256.00.15f04) [latest at time of writing]
-
All the SDK examples compile and work correctly.
-
An especial point of note is that the bandwidthTest (which uses cudaHostAllocMappedworks both with and without the pinned memory option.
-
I’m using MATLAB to run it (I have to - MATLAB precomputes the input data), so my CUDA code is a c MEX file.
-
Using the dynamic cuda runtime library (don’t knowthe syntax to compile/link a static one - tips appreciated)
-
Ouput from deviceQuery:
CUDA Device Query (Runtime API) version (CUDART static linking)
There is 1 device supporting CUDA
Device 0: “GeForce 320M”
CUDA Driver Version: 3.20
CUDA Runtime Version: 3.20
CUDA Capability Major/Minor version number: 1.2
Total amount of global memory: 265027584 bytes
Multiprocessors x Cores/MP = Cores: 6 (MP) x 8 (Cores/MP) = 48 (Cores)
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 16384 bytes
Total number of registers available per block: 16384
Warp size: 32
Maximum number of threads per block: 512
Maximum sizes of each dimension of a block: 512 x 512 x 64
Maximum sizes of each dimension of a grid: 65535 x 65535 x 1
Maximum memory pitch: 2147483647 bytes
Texture alignment: 256 bytes
Clock rate: 0.95 GHz
Concurrent copy and execution: Yes
Run time limit on kernels: Yes
Integrated: Yes
Support host page-locked memory mapping: Yes
Compute mode: Default (multiple host threads can use this device simultaneously)
Concurrent kernel execution: No
Device has ECC support enabled: No
Device is using TCC driver mode: No
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 3.20, CUDA Runtime Version = 3.20, NumDevs = 1, Device = GeForce 320M
This could be a whole load of things, but I think there are four possible sources of the problem:
-
Bug in CUDA which only occurs on this device
-
Launching this from within another process (MATLAB) interferes with cudaHostAlloc somehow
-
Bug in my code
-
Incorrect CUDA libraries being used
Does anyone have any insight into which if these is most likely to be the problem? Or if it’s something else entirely?
I’ll include my code. Sorry, it’s possibly a bit confusing as there’s the MATLAB MEX interface too…
Thanks for any help you can offer
- Tom Clark
/*
* This is a mex file to perform batch 3d ffts in MATLAB
*
*/
#include "math.h"
#include "mex.h"
#include "cufft.h"
#include "cuda_runtime.h"
#include "matrix.h"
#include "stdlib.h"
#include "cutil_inline.h"
// Quick function to transfer data from MATLAB to Host memory
void transferMxToHost( float *output_float,
float *input_float,
int Ntot)
{
int i;
for (i = 0; i < Ntot; i++)
{
output_float[i] = input_float[i];
}
}
void mexFunction( int nlhs, mxArray *plhs[],
int nrhs, const mxArray *prhs[])
{
/* Declare variables */
int M, N, P, batch, nDims;
int *batchPtr;
int fftSizeArray[3];
float *inputPtr, *outputPtr; // Pointers to matlab mxArray contents
float *h_input; // Host pointer for mapped memory
cufftReal *d_input; // Device pointer for mapped memory
cufftComplex *d_inter; // Device pointer for intermediate array
cufftHandle planForward;
cufftHandle planInverse;
const mwSize *dimArray;
cudaDeviceProp properties;
/* Declare error flags */
cufftResult fft_err = CUFFT_SUCCESS;
cudaError_t cuda_err = cudaSuccess;
// Check inputs OMITTED FOR CLARITY
// Get array size in 3 dimensions
dimArray = mxGetDimensions(prhs[0]);
M = dimArray[0]; // number of rows in the input mxArray.
N = dimArray[1]; // number of cols in the input mxArray.
P = dimArray[2]; // number of pages in the input mxArray.
// Get number of batches
batchPtr = (int *)mxGetData(prhs[1]);
batch = batchPtr[0];
// Get third dimension of transform size
if (( P % batch ) == 0){
P = P/batch; // N.B. integer division
}else{
mexErrMsgTxt("Third dimension of input array size must be an integer multiple of the number of batches.");
}
// Check, set and print device properties
int device = 0;
cudaGetDeviceProperties(&properties, device);
if (!properties.canMapHostMemory){
mexPrintf(" mex_batchfft3: Cannot run: GPU Incapable of mapping host memory. Check NVIDIA website for details");
mexErrMsgTxt("Unable to map host memory to device");
} else {
cuda_err = cudaSetDeviceFlags(cudaDeviceMapHost);
if (cuda_err != cudaSuccess){
mexPrintf(" CUDA Runtime API Error reported : %s\n", cudaGetErrorString(cuda_err));
mexErrMsgTxt("Failed to set device flag 'cudaDeviceMapHost'");
}
}
// NB Set the device AFTER setting the cudaDeviceMapHost flag
cuda_err = cudaSetDevice(device);
if (cuda_err != cudaSuccess){
mexPrintf(" CUDA Runtime API Error reported : %s\n", cudaGetErrorString(cuda_err));
mexErrMsgTxt("Failed to set active device");
}
// Allocate local (host) memory to hold real input data, mapped for transfer between host and device
// **** THIS IS WHERE THE ERROR OCCURS ****
cuda_err = cudaHostAlloc((void**)&h_input, N*M*P*sizeof(float)*batch, cudaHostAllocMapped);
if (cuda_err != cudaSuccess){
mexPrintf(" CUDA Runtime API Error reported : %s\n", cudaGetErrorString(cuda_err));
mexErrMsgTxt("Failed to allocate host memory");
}
// Pointer for the real, single precision input data array
inputPtr = (float *) mxGetData(prhs[0]);
mexPrintf(" mex_batchfft3: Retrieved Input Pointer\n");
// Copy Data from MATLAB mxArray to local (host) memory-mapped array.
transferMxToHost((float *)h_input,(float *)inputPtr,N*M*P*batch);
mexPrintf(" mex_batchfft3: Copied data from MATLAB mxArray to local host device-mapped memory\n");
// REST OF CODE OMITTED FOR CLARITY
// Return control to MATLAB
return;
}
edit: changed code a little to more clearly reflect the problem.