memset() copies bytes at a time. Although the second argument is an int, it gets truncated to a byte. This is the behavior of the C Standard Library, and also the behavior in CUDA.
Additionally, the last time I checked cudaMemset() was unreasonably slow. I recommend just writing a very simple kernel that fills your memory with the values you need.
cuMemsetD32() might do what you need, but you cannot mix Driver API functions (those that start with cu*) with Runtime API functions (cuda*)