cudaMalloc bug?

OK so I’ve got a cuda program that’s resampling some input data. You give it a a ratio U/D and it resamples the input data for you.

I noticed when I changed the code that calculates my output buffer size from:

out_buff_size = (int)round(res_ratio * (float)PTS_PER_ITER+.5);

to

out_buff_size = (int)round(res_ratio * (float)PTS_PER_ITER);

I started getting unspecified launch errors. (Note that the above changed the size of my output buffer from 180225 to 180224 elements, no biggie).

A little investigating lead me back to some buffers that didn’t seem to be allocated correctly, so I changed these two lines:

    

CUDA_SAFE_CALL(cudaMalloc((void**)&dev_out_buff_si, sizeof(short) * out_buff_size));

CUDA_SAFE_CALL(cudaMalloc((void**)&dev_out_buff_sf, sizeof(float) * out_buff_size));

to

CUDA_SAFE_CALL(cudaMalloc((void**)&dev_out_buff_si, sizeof(short) * (out_buff_size+1)));

CUDA_SAFE_CALL(cudaMalloc((void**)&dev_out_buff_sf, sizeof(float) * (out_buff_size+1)));

And the error resolved itself (note that the CUDA_SAFE_CALL never reported an error from cudaMalloc).

Is this a bug with cudaMalloc where it doesn’t properly allocate buffers of certain sizes? I’m using SDK 2.0 on Linux. Note also that I tried cutting my input buffer size in half, and still got the same behavior…

Traditional answer is to compile the code that fails on the device with -deviceemu and run it through valgrind. I would be very, very surprised if there’s an error in cudaMalloc and not your kernel.

The kernels actually fail to launch when the bug manifests, I double checked and made sure my grid and block sizes where on the up and up, and the appeared to be. In fact, since I’m dividing the input signal up into blocks, the end of the buffer is in a block that only partially executes, so the grid and block size for the working and non-working case come out the same…

They fail to launch? What’s the error returned?

(unspecified launch error = segfault in kernel)