Problem with cudaMemCpy on Mac

I recently ported some of my stuff from linux cuda 1.1 to Mac cuda 2.0 and I have problem with the code while running.

I do not get any compile time error, but I get wrong run-time output because of problem with cudaMemCpy. To simplify the problem, consider the code snippet below (Note that I am not invoking kernel at all):

float *a,*b;
a = (float *) malloc( sizeof(float ) * (N+1) *(N+1) );

for (int ii = 0; ii <= N; ii++) {
for (int jj = 0; jj <= N; jj++) {
a[ii+jj*(N+1)] = 1.0;
}
}

cudaMalloc((void**)&b, sizeof(float)(N+1)(N+1));

cudaMemcpy(b, a,sizeof(float)(N+1)(N+1),cudaMemcpyHostToDevice);

float *c;
c = (float *) malloc( sizeof(float ) * (N+1) (N+1) );
for (int ii = 0; ii <= N; ii++) {
for (int jj = 0; jj <= N; jj++) {
c[ii+jj
(N+1)] = 2.0;
}
}

cudaMemcpy(c, b,sizeof(float)(N+1)(N+1),cudaMemcpyDeviceToHost);
int i,j;
printf(" after is…\n “);
for(i=0;i<=N;i++) {
for(j=0;j<=N;j++) {
printf(” %f “, c[(i*(N+1))+j]);
}
printf(”\n");
}


The expected output is a matrix with all entries 1.0. Whereas I get the original matrix with all entries as 2.0. Can someone help me in this?

Thanks.

Btw, it works fine in emulation mode and i get proper output. (A matrix with all entries as 1.0)

Please provide a complete test app which reproduces the problem.

// includes, system

include <stdlib.h>

include <stdio.h>

include <string.h>

include <math.h>

// includes, project

include <cutil.h>

////////////////////////////////////////////////////////////////////////////////

// Program main

////////////////////////////////////////////////////////////////////////////////

int

main( int argc, char** argv)

{

int N= 16;

float *a,*b;

a = (float *) malloc( sizeof(float ) * (N+1) *(N+1) );

for (int ii = 0; ii <= N; ii++) {

for (int jj = 0; jj <= N; jj++) {

a[ii+jj*(N+1)] = 1.0;

}

}

cudaMalloc((void**)&b, sizeof(float)(N+1)(N+1));

cudaMemcpy(b, a,sizeof(float)(N+1)(N+1),cudaMemcpyHostToDevice);

float *c;

c = (float *) malloc( sizeof(float ) * (N+1) *(N+1) );

for (int ii = 0; ii <= N; ii++) {

for (int jj = 0; jj <= N; jj++) {

c[ii+jj*(N+1)] = 2.0;

}

}

cudaMemcpy(c, b,sizeof(float)(N+1)(N+1),cudaMemcpyDeviceToHost);

ifdef DEBUG

int i,j;

printf(" after is…\n ");

for(i=0;i<=N;i++) {

for(j=0;j<=N;j++) {

printf("  %f  ", c[(i*(N+1))+j]);

}

printf(“\n”);

}

endif

}


This is it. Let me know if you need anything else.

Is anyone able to reproduce this problem?

Thanks.

All 1.0s for me here on a MacBook Pro. What hardware are you using?

Using NVDIA GeForce 8800 GS with MAC OS X 10.5.2 and CUDA 2.0

Your code worked ok on my Mac Pro with a 8800 GT.
What is the output of deviceQuery?

There is no device supporting CUDA.

Device 0: “Device Emulation (CPU)”

Major revision number: 9999

Minor revision number: 9999

Total amount of global memory: 4294967295 bytes

Number of multiprocessors: 16

Number of cores: 128

Total amount of constant memory: 65536 bytes

Total amount of shared memory per block: 16384 bytes

Total number of registers available per block: 8192

Warp size: 1

Maximum number of threads per block: 512

Maximum sizes of each dimension of a block: 512 x 512 x 64

Maximum sizes of each dimension of a grid: 65535 x 65535 x 1

Maximum memory pitch: 262144 bytes

Texture alignment: 256 bytes

Clock rate: 1.35 GHz

Concurrent copy and execution: No

Test PASSED

Press ENTER to exit…

Ouch… Why does this happen? My system profiler shows 8800 GS.

Try to reinstall the toolkit.
It should ask you to reboot ( if not check in the custom setting that you are loading the kernel module for cuda)

That worked! Thanks a lot mfatica for your reply. :)