Compiling error in Ubuntu 14.04, CUDA Toolkit v7.5

I recently updated my CUDA Toolkit from v5.0 to v7.5 so that I could take advantage of a new CUBLAS library function, cublasDgelsBatched(). My code is written in C with the MEX gateway function for MATLAB command line use. However, compilation appears to fail at the linking step. I wrote a simple program, included here, to try to isolate the problem. Here’s what I did at the Ubuntu 14.04 command prompt:

First I made sure to include relevant directories in my PATH. Then:

nvcc -c -I$MATLAB_ROOT/extern/include -DMATLAB_MEX_FILE -gencode=arch=compute_20,code=sm_20 --compiler-options=-ansi,-D_GNU_SOURCE,-fPIC,-fno-omit-frame-pointer,-pthread -DMX_COMPAT_32 -O -DNDEBUG “”

g++ -O -pthread -shared -Wl,–version-script,/usr/local/matlab/extern/lib/glnxa64/ -Wl,–no-undefined -o “MatMul.mexa64” MatMul.o -Wl,-rpath-link,/usr/local/matlab/bin/glnxa64 -L/usr/local/matlab/bin/glnxa64 -lmx -lmex -L/usr/local/cuda/lib64 -lcudart

This works, but when I call “MatMul” at the MATLAB command line, I get this error:
Invalid MEX-file ‘/mnt/data0/elliot/MatMul.mexa64’: cannot open shared object file: No such file or directory

On the other hand, if I change my path to include library files from CUDA Toolkit v5.0 (i.e. nvcc &, then I am able to compile AND RUN MatMul properly. These are the commands to do it - notice that since my version of MATLAB was compiled against CUDA Toolkit 5.0, the file is in a matlab directory.

nvcc -c -I/usr/local/matlab/extern/include -DMATLAB_MEX_FILE -gencode=arch=compute_20,code=sm_20 --compiler-options=-ansi,-D_GNU_SOURCE,-fPIC,-fno-omit-frame-pointer,-pthread -DMX_COMPAT_32 -O -DNDEBUG “”

g++ -O -pthread -shared -Wl,–version-script,/usr/local/matlab/extern/lib/glnxa64/ -Wl,–no-undefined -o “MatMul.mexa64” MatMul.o -Wl,-rpath-link,/usr/local/matlab/bin/glnxa64 -L/usr/local/matlab/bin/glnxa64 -lmx -lmex /usr/local/matlab/bin/glnxa64/

Question 1) why does the first approach not work, while the second does?

Question 2) in approach 2, why do I not have to place a -l in front of the full name of /usr/local/matlab/bin/glnxa64/ to instruct the linker to link it? In fact, pre-pending -l to that name will cause an error for ld.

I forgot to include my code. Here is…

#include “cuda_runtime.h”
#include “device_launch_parameters.h”
#include “mex.h”

/* Thread block size*/
#define BLOCK_SIZE 16

/* Matrices are stored in row-major order:/
M(row, col) = (M.elements + row * M.stride + col)/
typedef struct {
int width;
int height;
int stride;
double *elements;
} Matrix;

/* Forward declaration of MatMul function */
void MatMul(const Matrix A, const Matrix B, Matrix C);

/* Forward declaration of the matrix multiplication kernel*/
global void MatMulKernel(const Matrix, const Matrix, Matrix);

/* mex interface function */
void mexFunction(int nlhs, mxArray *plhs, int nrhs, const mxArray *prhs)
mwSize mA = mxGetM(prhs[0]);
mwSize mB = mxGetM(prhs[1]);
mwSize nA = mxGetN(prhs[0]);
mwSize nB = mxGetN(prhs[1]);
int Stride = nA/BLOCK_SIZE;
Matrix A, B, C;

plhs[0] = mxCreateDoubleMatrix(mA,nB,mxREAL);

A.width = nA;  A.height = mA;  A.stride = Stride;  A.elements = mxGetPr(prhs[0]);
B.width = nB;  B.height = mB;  B.stride = Stride;  B.elements = mxGetPr(prhs[1]);
C.width = nB;  C.height = mA;  C.stride = Stride;  C.elements = mxGetPr(plhs[0]);

/* Throw errors if input does not have correct format */
if(nrhs != 2)
    mexErrMsgTxt("Two inputs (A and B) are required.");
else if(nlhs > 1)
    mexErrMsgTxt("Too many output arguments.");

if(!mxIsDouble(prhs[0]) || mxIsComplex(prhs[0]))
    mexErrMsgTxt("Input must be noncomplex double.");
if (nA != mB)
    mexErrMsgTxt("Number of rows in A must equal number of columns in B.");

MatMul(A, B, C);


/* Get a matrix element*/
device double GetElement(const Matrix A, int row, int col)
return A.elements[row * A.stride + col];

/* Set a matrix element*/
device void SetElement(Matrix A, int row, int col,
double value)
A.elements[row * A.stride + col] = value;

/* Get the BLOCK_SIZExBLOCK_SIZE sub-matrix Asub of A that is*/
/* located col sub-matrices to the right and row sub-matrices down*/
/* from the upper-left corner of A*/
device Matrix GetSubMatrix(Matrix A, int row, int col)
Matrix Asub;
Asub.width = BLOCK_SIZE;
Asub.height = BLOCK_SIZE;
Asub.stride = A.stride;
Asub.elements = &A.elements[A.stride * BLOCK_SIZE * row
+ BLOCK_SIZE * col];
return Asub;

/* Matrix multiplication - Host code*/
/* Matrix dimensions are assumed to be multiples of BLOCK_SIZE*/
void MatMul(const Matrix A, const Matrix B, Matrix C)
/* Load A and B to device memory*/
Matrix d_A;
d_A.width = d_A.stride = A.width; d_A.height = A.height;
size_t size = A.width * A.height * sizeof(double);
cudaMalloc(&d_A.elements, size);
cudaMemcpy(d_A.elements, A.elements, size,
Matrix d_B;
d_B.width = d_B.stride = B.width; d_B.height = B.height;
size = B.width * B.height * sizeof(double);
cudaMalloc(&d_B.elements, size);
cudaMemcpy(d_B.elements, B.elements, size,

/* Allocate C in device memory*/
Matrix d_C;
d_C.width = d_C.stride = C.width; d_C.height = C.height;
size = C.width * C.height * sizeof(double);
cudaMalloc(&d_C.elements, size);

/* Invoke kernel*/
dim3 dimBlock(BLOCK_SIZE, BLOCK_SIZE);
dim3 dimGrid(B.width / dimBlock.x, A.height / dimBlock.y);
MatMulKernel<<<dimGrid, dimBlock>>>(d_A, d_B, d_C);

/* Read C from device memory*/
cudaMemcpy(C.elements, d_C.elements, size,

/* Free device memory*/


/* Matrix multiplication kernel called by MatMul()/
global void MatMulKernel(Matrix A, Matrix B, Matrix C)
Block row and column*/
int blockRow = blockIdx.y;
int blockCol = blockIdx.x;

/* Each thread block computes one sub-matrix Csub of C*/
Matrix Csub = GetSubMatrix(C, blockRow, blockCol);

/* Each thread computes one element of Csub*/
/* by accumulating results into Cvalue*/
double Cvalue = 0;

/* Thread row and column within Csub*/
int row = threadIdx.y;
int col = threadIdx.x;

/* Loop over all the sub-matrices of A and B that are*/
/* required to compute Csub*/
/* Multiply each pair of sub-matrices together*/
/* and accumulate the results*/
for (int m = 0; m < (A.width / BLOCK_SIZE); ++m) {

    /* Get sub-matrix Asub of A*/
    Matrix Asub = GetSubMatrix(A, blockRow, m);

    /* Get sub-matrix Bsub of B*/
    Matrix Bsub = GetSubMatrix(B, m, blockCol);

    /* Shared memory used to store Asub and Bsub respectively*/
    __shared__ double As[BLOCK_SIZE][BLOCK_SIZE];
    __shared__ double Bs[BLOCK_SIZE][BLOCK_SIZE];

    /* Load Asub and Bsub from device memory to shared memory*/
    /* Each thread loads one element of each sub-matrix*/
    As[row][col] = GetElement(Asub, row, col);
    Bs[row][col] = GetElement(Bsub, row, col);

    /* Synchronize to make sure the sub-matrices are loaded*/
    /* before starting the computation*/
    /* Multiply Asub and Bsub together*/
    for (int e = 0; e < BLOCK_SIZE; ++e)
        Cvalue += As[row][e] * Bs[e][col];

    /* Synchronize to make sure that the preceding*/
    /* computation is done before loading two new*/
    /* sub-matrices of A and B in the next iteration*/

/* Write Csub to device memory*/
/* Each thread writes one element*/
SetElement(Csub, row, col, Cvalue);


I think you answered question one yourself. MATLAB was compiled against CUDA 5.0 so the LD_LIBRARY_PATH variable probably points there. Either your have to update this variable in Ubuntu or in MATLAB.

Question 2 I’m not so sure about.

THanks inJeans. You are correct. I actually solved this problem in the interim and it was a problem with setting LD_LIBRARY_PATH. The solution is to set this variable with the export command from the bash command line. The reason it was not working before was because I was using the MATLAB setenv() command to do this, and the MEX script will overwrite LD_LIBRARY_PATH with ‘null’ if it’s done this way.

Hi, I have question about this.

I meet with “Invalid MEX-file” issue, too.

But I do not know where can I set LD_LIBRARY_PATH variable. I am not familiar with this variable. Can you explain more about it.

Thank you so much.

LD_LIBRARY_PATH is what is known as an environment variable. How you set one of those will very much depend on your environment. For example, on a Linux system with bash shell you would use export to set it, in the command shell of Windows you would use set to set it. There may also be permanent ways to set it, e.g. by adding it to the startup script for your Linux system or in Windows under Control Panel -> System and Security -> System -> Advanced Settings -> Environment Variables