Help with memory management

Hi. I am a little new to CUDA. I am writing a program for a problem in number theory. My problem is that the results I get can be many (potentially 6*10^15 longs). What I need to do is to be able to store results in a “buffer” on the device and periodically have the host read from the buffer (while the kernel is still executing). I know this is possible with mapped memory but my hardware does not support that. I was suggested to use texture memory for this purpose along with events. So my questions are:

  1. Is it possible to use events as a way to send an “interrupt” to the host (to have it trigger a copy from the “buffer”)?
  2. How would texture memory be useful in this context?
    As a side note, my kernel will be running for hours (maybe days) and the reason I need this is to record intermediate results.

Unfortunately no. For that sort of arrangement, the only practical solution is to have the host run the outer “coordinating” loop with a re-entrant kernel and have the host periodically sample the state of the solution to capture intermediate values or check for convergence

I don’t think it will be helpful at all (at least in the context of host-device data exchange). Texture memory is read-only within the context of a single kernel launch, so if you kernel is running for the lifetime of your application, there is no coherence, and nothing that the host can do to sample its context during execution.

So you are suggesting something as follows:

If i have 100*1000 combinations










In that case I guess texture memory might help me get the results quicker (since I will be doing a lot of copying)?


I have a quite silly question, its connected to memory management issues so i thought i post it here.
The problem is the following, I know its a very bad idea to increment a variable from different threads running concurently, because of the memory access conflict.
My question is, will it serialize the operation, or just produce untrustable data in the counter? because by my experiments it seems that the latter is true, even whit 512 blocks with 512 threads/block the the difference in computing time is 5-8ms, compared to the ~100 ms of complete computing time, and the untrustable counter data is fine with me, because i only want to know if there is a single data on the array wich satisfies the condition, and it seems that the first incrementation always succesful, or at least one/block. (I know about warp vote functions, but they cannot be used here)

Another question: as i know there is some kind of broadcast, if you try to read the same shared memory block, does this apply to a variable stored in global memory too?

If you want a counter in global memory, use the atomic increment function. Then it will be correct even with multiple threads.

Compute capability 1.2 and later devices have a memory controller which is smart enough to do this. I believe compute capability 1.1 and earlier devices would handle broadcast from global memory very inefficiently.

Oh I didnt know this. So if I want to write to a global array from across thread blocks, I could do that with an atomic increment? i.e. when a thread writes to a global array it increments the array pointer using an atomic increment. Does that work? I think not because the write itself is not atomic and multiple threads could read the pointer value and try to write at the same time. I am not sure how much sense I am making! :confused:

If you need more than just an increment, then there’s enough stuff in the atomics to handle locks as well. However, this is generally frowned upon, since having a few thousand threads compete for the same lock is unlikely to give optimum performance. Perhaps if you could describe what you’re trying to achieve, people could suggest alternative approaches which would avoid the need for a lock.

But as i know atomic functions DO serialize computation, which I want to avoid

How many threads are going to increment this counter?

I have a huge array, spread through multiple blocks, and i want to check if is there any non zero element in a given section, and preferably store the result in one element of another matrix. I know that the obvious choice shoud be warp vote, but the sections are not continious and which one is need to be checked is also conditional.

varying, all which satisifes a certain condition

Well, the direction I’m going with that question is this: If the fraction of threads which will access the counter is low, then the serialization penalty should be negligible.

Your other option is to have every thread satisfying the condition write a “1” to the same memory location. One of them will win, and you’ll have a success flag.

I actually using it as a success flag anyways, so thanks for the idea i will use =1 instead of increment. That way is perfectly legitimate, and wont hurt the parallel computation you say?

Well, again, if lots of threads have to write a “1” to one location, there will be warp divergence and serialization of writes still. But if there is substantial computation before this point, it won’t be a large effect overall. Writing a success flag should be slightly faster than an atomic increment, but it may not matter in the end.

I think a few of us have similar intentions. I have a lot (10^9 maybe) threads that will run in a single kernel call. Each of them may potentially want to write out a result which is a set of 6 longs. Typically though all the threads will not need to write out a result. So I need a global array to which all these threads could write to in unique locations. (I cannot allocate space for each thread as I would run out of memory so I want to allocate just enough for an estimated number of results). So will atomic operations with 10^9 threads do the trick by incrementing my array pointer atomically? I guess that means the memory writes will slow down my kernel.
I have another topic (occupancy and memory) discussing this problem. The reply there is also to use atomic instructions. I am posting here seeing there are a few others thinking along a similar line.

Right, the only reason to hesitate with atomics is that they are not as efficient for getting data into memory as a coalesced write. If nearly every thread is going to write something, you might find it faster to allocate an array of output slots for every thread, have every thread write something, then go back and filter out the results you don’t care about. This will maximize your output bandwidth and be better than forcing threads to write out piecemeal in an uncoordinated way.

However, if the output stage of your kernels is negligible, then doing whatever is most convenient should be fine.

How would you filter out the results? Don’t you come back to the same problem – you will need to increment an atomic counter. You could filter on the CPU, but (in my case) I have follow-on GPU calculations, so better to do the filter on the GPU. I have been reading about Thrust (something like copy_if or partition), but dont want to expand into another toolkit at this point. Is there a simple straightforward example somewhere on how to filter results?

This is called “stream compaction” and can be done without atomics with the help of a prefix scan pass to compute the target indices of all the elements. I’m not sure where to find a simple example of it, but there is an implementation of it in CUDAPP. Not sure if reading that source is any easier than Thrust, though.

Stream compaction is a non-trivial amount of code, so if it is easy to try the atomic approach, I’d do that first and see if the performance is acceptable.

Can you tell me whats the magic in 10+ blocks? How many blocks can really run parallel on a GTX260 (216sp), because my algorithm keeps failing when i try to strech it to 10 or 10+ blocks, until that limit everything is perferct, when i reach 10 blocks the results become unreasonable, they wouldnt even satisfy the exit condition, and still the kernel finishes, and the data is like what it should be, if the inner loop does not happend at all. (the algorithm itself works like this: every thread compares 4 variable, and store the result in the corresponding block of a huge matrix, then comes a loop which process the whole matrix a few times, until a certain exit condition does not satisfied, everything is happening in Global memory)

Okey it’s actually not the 10 blocks as it seems, because if i manually set 10 blocks with 10 threads each to compute a 1010 matrix, the result is flawless, but if i use 20 blocks with 500 threads each to compute a 100100 matrix, the result is crapy

And I got it right finally, it was only the matter of restricting some instructions to a single thread :D

Another question, is there a way to call a kernel from another one? If its possible, is it quicker than launching a new kernel from host code? (the two kernels working with the same parameters, the new one needed only for syncronization purposes)