I recently ported some of my stuff from linux cuda 1.1 to Mac cuda 2.0 and I have problem with the code while running.

I do not get any compile time error, but I get wrong run-time output because of problem with cudaMemCpy. To simplify the problem, consider the code snippet below (Note that I am not invoking kernel at all):

float *a,*b;

a = (float *) malloc( sizeof(float ) * (N+1) *(N+1) );

for (int ii = 0; ii <= N; ii++) {

for (int jj = 0; jj <= N; jj++) {

a[ii+jj*(N+1)] = 1.0;

}

}

cudaMalloc((void**)&b, sizeof(float)*(N+1)*(N+1));

cudaMemcpy(b, a,sizeof(float)*(N+1)*(N+1),cudaMemcpyHostToDevice);

float *c;

c = (float *) malloc( sizeof(float ) * (N+1) *(N+1) );
for (int ii = 0; ii <= N; ii++) {
for (int jj = 0; jj <= N; jj++) {
c[ii+jj*(N+1)] = 2.0;

}

}

cudaMemcpy(c, b,sizeof(float)*(N+1)*(N+1),cudaMemcpyDeviceToHost);

int i,j;

printf(" after is…\n “);

for(i=0;i<=N;i++) {

for(j=0;j<=N;j++) {

printf(” %f “, c[(i*(N+1))+j]);

}

printf(”\n");

}

The expected output is a matrix with all entries 1.0. Whereas I get the original matrix with all entries as 2.0. Can someone help me in this?

Thanks.