“The device memory heap has a fixed size that must be specified before any program using malloc() or free() is loaded into the context.”
“Heap size cannot be changed once a module load has occurred”
So when you call the cudaDeviceSetLimit function before any kernel call, it will succeed. Then you call a kernel that does a malloc operation. After that you cannot call the cudaDeviceSetLimit function again in your program.
Once you run a kernel that does a malloc operation, you can no longer call this function. If you do, it will return an error. Try a simple test case and you will see that the description in the documentation is accurate.
So decide what size you want the heap to be, taking into account all the needs of all the kernels in your program. Then set it once, at the beginning of the program, before any kernel calls.
After that, you cannot call it again. If you do, it will return an error.