Differences between cudaMalloc() and cuMemAlloc()

Does anyone know if there’s a difference between what cudaMalloc() does in the runtime API and what cuMemAlloc() does in the driver API?

The reason I ask is because I added some CUDA code using the runtime API to an existing application. This application has some custom memory management and threading features, and I was getting segfaults when calling cudaMalloc(). However, when I switched to the driver API, I was able to allocate memory, copy data to the device, and call a kernel without any crashes.

I’m not clear on the differences between the two APIs, so can anyone tell me why cudaMalloc() would crash and cuMemAlloc() doesn’t? Thanks!