Weird behavior of CUDA

Let tab_gpu be a float* of size N*M that I copy on the device memory of my 8800 GTX (768Mo DDR3).
What I do on my cuda fonction is only to set each element at 1.0 for example.
When M=N=512, no problem.
But when I use M=N=1024, tab_gpu is empty… I don’t understand…
Thx for your help!

Hi Vince, you might want to check your block configuration. The maximum number of threads per block is 512 and the maximum sizes of blocks are 512 in the x and y dimension and 64 in the z dimension. As far as I know (someone please correct me if I’m wrong), exceeding the maximum number of threads can cause the kernel not to launch at all. Check out page 88 in the Cuda Programming Guide for more specifications!

I think you’re exceeding block size limitation of 512 threads, but without source code I can’t confirm this.

Ok, let me be more precise :

#define BLOCK_DIM 256

//Setup execution parameters
dim3 grid(M*N / BLOCK_DIM, 1, 1);
dim3 threads(BLOCK_DIM, 1, 1);

So, the blocks have 256 threads and have a size of 256x1x1.

In this configuration, there is no problem with M=N=2048 but the problem appear when M=N=4096.

(Before, when it didn’t work with M=N=1024, the BLOCK_SIZE was 16).

Actually, the code is MATLAB code. I paste it below :

#include <stdio.h>
#include “mex.h”
#include “cuda.h”

#define BLOCK_DIM 256

// CUDA fonction
global void computeDistance(float* tab)
unsigned int xIndex = blockIdx.x * BLOCK_DIM + threadIdx.x;
tab[xIndex] = 1.0;

// MEX function
void mexFunction(int nlhs, mxArray *plhs, int nrhs, const mxArray *prhs){

int M;
int N;
float* output_dev;  
float* output_gpu;
int mem_size;
int er;

M = (int)mxGetScalar(prhs[0]);
N = (int)mxGetScalar(prhs[1]);
mem_size = M*N*sizeof(float);

// Allocation of output array
output_dev = (float *) mxGetPr(plhs[0]=mxCreateNumericMatrix(M*N,1,mxSINGLE_CLASS,mxREAL));
er = cudaMalloc((void **) &output_gpu, mem_size);
printf("Code erreur : %d\n",er);
// Copy host to device
er = cudaMemcpy(output_gpu, output_dev, mem_size, cudaMemcpyHostToDevice);
printf("Code erreur : %d\n",er);

//Setup execution parameters 
dim3 grid(M*N / BLOCK_DIM, 1, 1);
dim3 threads(BLOCK_DIM, 1, 1);

// Call GPU function 

// Copy results bask to host 
er = cudaMemcpy(output_dev, output_gpu, mem_size, cudaMemcpyDeviceToHost);
printf("Code erreur : %d\n",er);
// Clean-up memory on host and device


That’s simple to explain :)

4096*4096/256=65536 while maximum value for grid is 65535.

You need to make grid 2D to handle this.

Thanks for your help!
I’m gonna correct this right now.
One more time, thank you…