Memory on the Nvidia device between kernel calls tends to retain state

It seems memory on the Nvidia device is not necessarily cleared or reset between kernel calls.…orial-CUDA.html

The question is if this is consistent behavior?

Because if it is, we can store the data in CUDA memory, perform calculations in kernel1, get back data1,
perform calculations in kernel2, get back data2. It will be definitely a speedup.

What are you even asking? Do you have to do a PCIe transfer after every kernel call or can you just leave memory resident on the GPU? Of course you can do the latter…

The question is not obvious.

cudaMalloc and cudaFree are made on a kernel level. After cudaFree the memory is released and kernel finishes.

In the link above there 2 runs. 1) with kernel, 2)with kernel commented out.

When the kernel was commented a dump of memory shows it retain values.

Also I would add - what I want is to have a persistent data to remain in memory, not to be overwritten and to be available
in next kernel calls.

Now, what will happen when:

  1. another cudaMalloc and cudaFree is made in the next kernel? In this case I am not sure the first data will remain in device memory.

  2. As far as I know in Windows 7, DirectX 11 is using the CUDA device. What will happen if DirectX 11 allocates its own memory in the
    device in time between my kernel calls?

I still have no idea what you’re really asking.

cudaMalloc and cudaFree are identical to normal malloc/free, which means that whatever values happened to be in that chunk of physical memory are readable after allocation. If you allocate N bytes, write something to it, free that region, and then immediately allocate N bytes, there’s certainly a good chance that you’ll get the original N byte region back (just like with normal malloc), but this is absolutely not something to be dependent on (just like normal malloc). If thread 2 allocates N bytes between thread 1’s free and second allocation of N, who knows what will happen?

(also, just like on the CPU, if you malloc/free around every function even when the sizes are the same between functions you’re doing it wrong)

I think some of the confusion here is assuming that the card has some kind of unprotected memory space that could be overwritten by another process using the card. There is virtual memory translation going on in the device, so other GPU contexts cannot see your memory. Calls to cudaMalloc() and cudaFree() by other processes cannot affect your memory space. (Although driver bugs in the past have resulted in crashed system when accessing random memory locations.) Now, if someone else allocates all the remaining memory on the card, then your process won’t be able to allocate anymore memory.

I want to have a persistent data between kernel calls to eliminate the need of uploading data every time in each kernel call

in the device because it is slow. But is seems I have to upload data in device in every kernel call and perform all calculations

in just one kernel call. Or CUDA is designed to make only one kernel call in an application?

Again, what? You can

  • allocate memory
  • copy memory from the CPU
  • run 50,000,000,000 kernels on any regions you have allocated
  • copy memory back to the CPU
  • free memory
  • repeat if you want

There’s no pairing of cudaMalloc to a specific kernel call or anything like that. It’s like malloc on the CPU.

I will try to explain better:

  1. kernel1 - allocate memory1

  2. kernel2 - allocate memory2, calculations on memory1, results in memory2, copy memory2 back to CPU. Free memory2

  3. kernel3 - allocate memory3, calculations on memory1, results in memory3, copy memory3 back to CPU. Free memory3

  1. kernel100 - - allocate memory100, calculations on memory1, results in memory100, copy memory100 back to CPU. . Free memory100

  2. kernel101 - free memory1

If 5) is possible then the question is answered.

That was my question - how stable it is? Because Windows 7 already uses CUDA too.

Why wouldn’t it be possible? of course it is… thats what tmurray said - its just like malloc/free or new/delete in regular c/c++ code.

The question is, and I think no-one really understands what you really ask, is why do you ask this? or what you’re tring to


With every “copy memoryX back to CPU” you’re PCI-e bounded and this might be your bottle-neck.

Question is why you need this and if it can be avoided… otherwise you’ll have to pay this device->host penalty.

edit- oh, one more potentialy misconception. You dont allocate/free memory from within the kernel code. You do it

outside the kernel, on the host side and then pass the device pointer to the kernel to operate on it.


cudaMalloc and cudaFree - is this outside the kernel?

oh you see, I guessed right… :)

yes, you do all your memory allocations/freeing outside the kernels …

The question is how stable it is. If the CUDA application is only my own application

I suppose it is stable. If Windows 7 via DirectX 11 is another running CUDA application

and there is context switching then it is not obvious it will work.

If it is for internal use it is one thing, if my application is installed on many different machines

  • I believe it is a good question to know how stable it is in different environments.

I believe seibert answered this in post #6. Give some credit to nVidia that such an obvious thing

would work…

I believe question is answered if context switching between CUDA applications

is stable ;)

e.g. what happens this case:

  1. cudaAlloc

  2. kernel

  3. context switch (DirectX?)

  4. cudaFree

seibert says:

Calls to cudaMalloc() and cudaFree() by other processes cannot affect your memory space. (Although driver bugs in the past have resulted in crashed system when accessing random memory locations.)

I am sure nVidia made it right but this is a software and I am doing a work to be delivered to many different computers with different configurations.

It is not possible to test all permutations a possible driver bugs can crash the system as you see.

Context switches don’t matter, they’re not going to magically explode the data you have. WDDM actually does paging anyway, so if you have a 1GB card and App 1 requests 800MB and then App 2 requests 800MB, WDDM will actually page things in and out.

Thanks a lot, this explains to me how it works.

I have one more question. My card is Quadro FX 1800.

It is reported from deviceQuery.exe:

Total amount of global memory: 805,306,368 bytes.

What amount actually is free for use from my CUDA application? Because I made several tests

and come close to 80% of this reported 800MB. Are there reliable formula to calculate this?

Thanks in advance

Not particularly, it depends on the amount of framebuffer used (or in Vista how much you’re willing to page).