Compute column means and variances in large 2D array using CUDA/MATLAB -- PLEASE HELP!

Dear CUDA gurus,

I am a novice regarding CUDA & GPU parallel programming (sorry in advance for my possibly silly questions).
I want to parallelize several of my MEX files that compute different univariate statistics on large 2D arrays for each column in the array separately. Therefore, I wrote a little test program that should calculate the column means of a 2D array. However, it doesn’t work:

#include “mex.h”
#include “cuda.h”

/* ************* kernel ************* */

global void means(float* in, float* out, int n, int m)
{
int jdx;
int idx = blockIdx.x*blockDim.x+threadIdx.x;

out[idx] = 0;
if ( idx < n)
{
for (jdx=0; jdx < m; ++jdx)
out[idx] += in[idx*m+jdx];
out[idx] /= m;
}
}

void mexFunction( int nlhs, mxArray *plhs, int nrhs, const mxArray prhs[])
{
int dims0[2];
float in, out;
float gin, gout;
int pitch;
int m = mxGetM(prhs[0]);
int n = mxGetN(prhs[0]);
int sizeMN = m
n
sizeof(float);
int sizeN = n
sizeof(float);
dims0[0]=n; dims0[1]=1;
plhs[0] = mxCreateNumericArray(2,dims0,mxSINGLE_CLASS,mxREAL);
out = (float
) mxGetData(plhs[0]);
in = (float
) mxGetData(prhs[0]);

// Alocate space on the GPU
// Input array
cudaMalloc((void**)&gin,sizeMN);
// Output array
cudaMalloc((void**)&gout,sizeN);

// Copy array to GPU
cudaMemcpy(gin,in,sizeMN,cudaMemcpyHostToDevice);

dim3 dimBlock(256);
dim3 dimGrid((mn)/dimBlock.x);
if ( (n
m) % 256 !=0 ) dimGrid.x+=1;

// Compute kernel
means<<<dimGrid,dimBlock>>>(gin,gout,n,m);
cudaMemcpy(out,gout,sizeN,cudaMemcpyDeviceToHost);

// Free allocated memory
cudaFree(gin);
cudaFree(gout);
}

…obviously, this doesn’t work, but I have no idea how to write the kernel so that the computation is done correctly. Do I have to distribute the N columns of the array to N block having M threads?

Any help would be greatly appreciated!!!

Best,
NikosK

I think the most obvious solution is to use one thread for each column (I’m not sure if N or M is the number of columns here), such that dimBlock.x = N, (or M) dimBlock.y = 1, dimBlock.z = 1

In the kernel you should use shared memory, and don’t write to the global memory each iteration. Without the shared memory, this code should be faster than yours, this code does not however result in coalesced reading from memory I think

global void means(float* in, float* out, int n, int m)
{
int jdx;
int idx = blockIdx.x*blockDim.x+threadIdx.x;

float mean = 0;

if ( idx < n)
{
for (jdx=0; jdx < m; ++jdx)
mean += in[idx*m+jdx];
}

out[idx] = mean/m;
}

Dear wanderine,

thank you very much! You really helped me!

I would like to test the shared memory strategy you mentioned, but I don’t know how to properly set up my variables for shared memory access on the GPU.

Do I have to say …

shared float gin[m][n];

shared float gout[1][n];

… at the host side of my code?

… and what do you mean by “coalesced reading from memory” ?

Sorry for bothering with more questions !

Best,

NikosK

Chapter 5 of the CUDA programming guide contains the answers to all these questions, plus a few other things you will need to know.