Runtime API error 4: unspecified launch failure on cudaMalloc

Hello,

My code is crashing with the error in the title on a cudaMalloc.

This is strange to me because 1) its not a kernel, and 2) the amount of memory is far less than the amount of free on the device. I checked the free space right before this call, and that operation succeeds, which is how I know there is plenty of space.

This only started happening after upgrading to CUDA 4.0 earlier today.

Does anyone have any suggestions on what to look for?