cudaMemcpy sometimes doesn't work

I have a program to process an image. And currently there is a very strange problem. The program works for some images, but give me wrong output for some other images.

When I try to track the program, it seem that cudaMemcpy() doesn’t copy data appropriately. If the program runs correctly, then there is a line in the file “cuda_profile.log”.

method=[ memcopy ] gputime=[ 519.360 ]

The time is for copying an array with size 250000, which is the number of pixels in the image.

But if it is wrong, then that line is shown below.

method=[ memcopy ] gputime=[ 97.856 ]

That is the time for copying an array with size 300000, again that is the number of pixels in the image.

Does anyone know what could cause the problem?

Thank you in advance.

You are probably measure not the cudaMemcpy() performance but your kernel’s performance. You must call cudaThreadSynchronize() before beginning your timer, or else the synchronization between the CPU and GPU happens during the memcpy and takes up most of the time.

The reason your kernel is ending early is probably because it is accessing an illegal address. Look at an SDK sample and see the calls that are made to check for errors immediately after a kernel execution.

Thank you for your reply.

I tried to use CUT_DEVICE_INIT(argc, argv); in my main function, and then use CUDA_SAFE_CALL whenever I use cudaMalloc() cudaMemcpy() and cudaFree(). There is no error message, but the program still doesn’t work.

The problem is strange. My program process one image at time. If I use an image with size X as input, it works. Then if I use another image with size Y, then the program will fail. Once the program failed, it won’t work even I process the image with size X again. I thought maybe somewhere I didn’t clean up the memory, but I checked my program several times, everything is cleaned. So I don’t know what could cause the problem.

What i was thinking of is the macro CUT_CHECK_ERROR(). Use it after every kernel call (except keep in mind the real CUT_CHECK_ERROR() only works in debug builds)

I think your kernel is accessing memory out-of-bounds and causing an exception. Before it crashes it seems to overwrite critical cuda-related portions of GPU memory. Or something. Anyway, it’s not unusual for a crashed kernel to need a reboot to clean up, at least on some platforms. Out of curiosity, which OS are you using?

I am using winxp with cuda 2.0.

I found two posts online,

It seems that some people recommended to use one GPU for desktop display and use the other GPU for running cuda program. I am not sure if that’s the problem in my case, but it worths a try.

Just one question, how do I specify which GPU I want to use for my cuda program?

Thank you.

cudaSetDevice(). Check out the CUDA Reference Manual, section Device Management RT (the very beginning).