Fermi memory management different?

As I know, currently it is possible to dynamically allocate memory in kernel or device functions on Fermi (compute 2.x) but not on Tesla (compute 1.x). Is it really because of hardware difference? If yes, what is it? Thanks.

I have a workstation has some Tesla C1060 cards (they are really big blocks). I want to do some CUDA programming. I need lots of (usually >1e+7) threads. Each thread needs variable number, usually tens, of malloc/realloc/free operations in kernel/device functions. After some survey, I still found no good answer. It seems the only way left is to start a project to program SIMT-oriented malloc functions, or I have to allocate fixed-size (although sizes are different among threads) memory region for each thread (however this will reduce some computation accuracy), or the blocks (and the bucks!) seems becoming a waste :-( But if I could do this project, why NVIDIA did not? That’s why I started this topic. Hope anyone could provide any hints, solutions…

Even any authorities telling me that “no, it is impossible” will give me relief. This problem has tortured me long time.

No, it is impossible

– nay I’m just joking External Image I have no idea too

It may be of help if you use 1 thread to do 64+ threads’ work. With 1e7 threads you’ll have at least 1e4 blocks. That sounds scary. I’m not sure whether Fermi could handle that efficiently

If you do as I advised, it will be pretty easy to do your own in-device allocation.

  1. Just carve out a huge chunk of memory space using cudaMalloc() and store the pointer in device memory (say, at address A)
  2. Then let threads in a block determine the total amount of memory needed by the block (using reduction sum). Let’s say the number of bytes needed is N
  3. then use atomicAdd to increase the value at A by N
  4. now for each block the base address of allocated memory will be the result returned by the atomicAdd
  5. the block will just have to determine the offset for each thread according to the amount of memory that they asked for

It’s better to always make sure N is a multile of 128 to help with alignment
but still, coalesced memory access is key to your performance while accessing those in-device-allocated memory. Depending on the way you use such memory, you’ll probably need more tricks than I have provided.

Thanks for reply. The method you suggest works for fixed-size memory allocation (for each thread).

However, in my case, each thread need to call many malloc/realloc/free functions. But before the running of each thread, we don’t know how many such functions will be called and how much memory will be allocated for each call. They are decided by random number generators. Despite of the high variability, to do this in 2.x devices is straightforward since malloc functions are available for kernel, but they are not available for 1.x devices currently. So I tried to find any alternative way to do this.

A slightly compromised way is available however. I could allocate a big memory region for each thread. But no matter how big is the region, the size demanded by the thread could be bigger. The good thing is, when the program is running, the parameters must be given. We roughly know the range of memory size of each thread based on parameter settings. So for each thread, we can estimate a certain (big) size which will cover most of the memory demands. At the cases when bigger (than the estimated) size is demanded, we could force the thread to use less memory, which will reduce some computation accuracy however. The bigger the estimated size, more accurate will the results be (since it will cover more demands). But considering the huge number of threads, we have to find a balance between accuracy and memory limit. Although this could work, it does not sound like a very good solution.

So before we resort to this compromised way, I hope someone here could tell us that the accurate way is actually out there, … or not.

So you are saying that each thread would do different numbers of malloc? My estimate is that your code is unlikely to run faster on GPU than on CPU.

Reason is that, from what I understand, you will have a lot of branching in the same warp, which will cause the different execution paths to be serialized. In that case, the MPs’ parallel computing power simply cannot be effectively utilized. Factor of slow-down will pretty much be equal to the number of different paths taken by threads within the same warp, assuming that all different paths have comparable lengths.

Another reason is that, it seems your memory accesses are subjected to too many variables (random?), therefore you can hardly have any coalesced access even with the best optimization tricks applied. And, in the worst case, the total factor of slow-down could be the product of the factors from the two causes.

That being said, I still believe that your algorithm can be executed on GPU without the compromises that you are expecting, if the total memory needed by your threads is not, at any time, larger than the available device memory.

The accurate way can be done. However, if you intend to use free() effectively, you will have to maintain an allocation table on your own and do compaction when necessary. I suppose this could slow things down a lot if the numbers of malloc() and free() you call are large.

If you insist not to increase the amount of computation that each thread does, things could still work out, if there is no dependency between threads. Even if there is, if you could bear with device-wide synchronization, there still are ways to make it work.

I’m making very general statements here because I have no idea what your threads will actually do. Though I guess you are implementing some stochastic algorithm. Is it some large genetic algorithm?

Yes.

That’s also I worried. I will have a test. But if the number of MP (say M) is big enough, GPU version could still be faster than single-CPU version. Assume the worst case, only one thread runs on each MP, then we still have M threads running in parallel. Even few times faster has meaning in our project.

Yes, this way is accurate. And this will actually produce in-device malloc functions for 1.x devices. It’s not easy anyway.

There is no dependency between threads. What are the ways?

Yes, it’s stochastic but it’s not GA. It’s for biological purposes.

Actually, the allocation table is not hard to implement. It just slows things down a lot when there are too many malloc and free.

If there’s no dependency between threads, things could be better. There will be no need to maintain an allocation table if your blocks do not demand too much memory (meaning free is not needed). Let’s say you have 512M of device memory available to you and 16 MPs. This means that you can effectively have 16 blocks running at one time. As long as 16*1024 threads do not demand more than 256MB (half of the total available, or it could be larger if not all of it needs to be sent to the host) of memory, you can just use the allocation method I mentioned in my first reply. The other half of the 512M will be used for concurrent memcpy to host in a different stream. Of course, atomic actions from 16MPs will also slow down the MPs quite a lot. So you can carve out 16 chunks of memory space, each sized at 16MB. As long as one block (maximal 1024 threads) does not demand more than 16MB of memory, this will work. Of course, you could always reduce the number of threads per block to ensure that one block does not need too much memory. For that, you’ll want to have some good ILP to hide the latencies better.

You will divide your entire kernel execution into a lot of launches, each time launching only 16 blocks. If you are familiar with streams, I think you would already know what I’m talking about.

To sum up, you’ll have 16 big memory spaces, each used by a different block for dynamic allocation. Each kernel launch will launch with 16 blocks, or more if each block actually does not demand as much memory as I have imagined. Then you will move the data produced by the 16 blocks back to host, and then launch another grid of 16 blocks. Do this over and over, until all blocks are launched.

I’m new in CUDA programming and not very sure of coding complexity yet. Source code of CPU-version malloc() is easily available. In our case, I think we need to do some modifications to avoid dead locks created by resource competition between threads. I may try this later and may consult you (maybe privately if you don’t mind).

Considering what I said “no matter how big the memory size estimated, there could be very rare cases when some memory demands surpass estimated memory size”, your method could still be considered a compromised one (that means losing some accuracy). But your multi kernel launch strategy did inspire me (thank you) and should have higher accuracy than my compromised method since each thread could use more memory. I may try your method too although I’m not sure of all technical details you mentioned yet (so I may consult you about this method as well).

No problem. Though I’m actually not really an expert in CUDA. I also just started 1/2 months ago.

It certainly will be of great help if you gain substantial amount of experience with CUDA first before you embark on a huge project. By that time perhaps you will have realized, after all, that many of the things that we have discussed here may not be really necessary.

I believe you have already read the programming guide and best practice guide many times. Next you might want to do more hands-on experiments. The book CUDA by Example could be helpful. As you progress, you might then want to use cuobjdump/decuda more often. Reading the ptx/cubin code is very useful for you to understand how certain processes work.

Thanks for your suggestions and recommendations. I’ll be glad to follow your advices. Indeed, more practices and readings are necessary for me.