I didn’t understand what you mean by swapping allocated memory to system RAM. Maybe what you observed is the allocation of the “staging area” in system RAM by cudaMalloc, since it needs a non-pageable space to copy stuff between host and device. Do a search for “staging area cuda” or pinned memory to have some explanation.
Simulating an amount of device memory is easy. Consider you want to allocate a certain amount of memory in the video card, but it depends on how much you have free. That is, you can’t use absolute values here:
size_t free = 0, total = 0, alloc_size = 0;
free *= 0.5;
alloc_size = free * 0.1 * sizeof(float);
cudaMalloc((void **) &dev_data, alloc_size);
cudaMemset(dev_data, 0, alloc_size);
cudaMemcpy(dev_data, input_data, alloc_size, cudaMemcpyHostToDevice);
// Do something with the data in device, call your kernels...
What this code does is: it checks how much free memory you have, then calculate just half of it to simulate a card with less capacity. Then it allocates 10% of this pseudo-free space, and not 10% of what is really available.
It then copies to this piece of memory whatever data is in “input_data” (host) and call your kernel(s) accordingly.
This does allocation with relative free space, simulating a card with less capacity.
Notice that it doesn’t do any error checking, which you must do, and it could be further simplified using Thrust’s host/device_vector, getting rid of this manual memory management.