No, not currently.
Thanks
Hi Mark,
solved! I was never doubting CUDA runtime has a bug. It is quite embarrassing for me:
I
used in a heave multi threaded C++ program the strtok function instead
of strtok_r. Due this stupid mistake in very rare situations the buffer
was not filled correctly!
How ever, because for my algorithm it
is not really needed to used the Managed Memory feature I switched to
the cudaMalloc() function. I limit the amount of maximal GPU threads to the warp
size of the GPU [32].
Now I can say after extensively tests it runs perfectly.
Mark, I admire your support here and that you find time to answer all these questions!
All the best for you and your family 2015!
Hi Mike,
what do you think about: https://www.bu.edu/caadlab/... ?
I have a project where I'd like to use UMA if it really pays off. From
that analysis, I see the best improvements, when happen, are modest
versus non-UMA. In my personal case, I have a large set of data, but
each thread only needs to access a bit of it and solve small (<20x20)
linear problems, but many of them > 1e9 total. I had thought of using dynamic parallelism for some of those where the condition number is better to go ahead with the next bunch. So, dynamic parallelism and UMA seemed to me the way. But ... I am a bit confused now.
Thanks,
Sergi
Not sure who Mike is, and it's Unified Memory, not UMA (UMA is something specific -- related, but different). Anyway, it does sound like your problem may be amenable to both Unified Memory and Dynamic Parallelism.
Hi _Mark_,
sorry about the name. I have been trying to find the answer by myself reading quite a few documents and posts last days and for some reason I failed in my first post here.
By UMA, I am sure we both understand Unified Memory Access. I had understood you need UMA to make use of Unified Memory. The abstract in that paper seems quite clear in the relationship with Unified Memory in this post and cuda>6. The rest of the paper seems quite respectable too. What I would like to know is whether you agree with them and the apparent slow down effects or it can be mitigated. (NB: I have gone ahead with Unified Memory in my project with a K20m).
Thanks
The problem is that UMA is already an existing acronym, which a different meaning from the way the authors of that paper use it. UMA stands for *UNIFORM* Memory Access, not Unified Memory Access. The Uniform refers to the performance of memory accesses -- it means that access time is independent of which processor makes the access. See http://en.wikipedia.org/wik...
This is definitely not the same as Unified Memory.
The paper is otherwise reasonable. As we have said in the past, the initial versions of Unified Memory are aimed at simplifying heterogeneous programming, making it easier to port applications especially with complex data structures. Future releases (and hardware) will continue to improve this ease of use and improve performance.
Hello Mark
This shared memory is provided by the hardware (Keplar) or only software abstraction? Would this not work on Tesla or Fermi?
Shared memory is something specific and different in CUDA (it is a small on-chip memory in the GPU SM that threads can share data in). On Kepler, Unified Memory is a software abstraction with hardware support. It is not supported on earlier architectures.
By shared I meant shared between GPU and CPU. I'm aware about normal shared memory in gpu which is shared in block.
In this paper http://dedis.cs.yale.edu/20... they talk about cpu-gpu shared memory in cuda -5.0.
which I didn't find any where in documentation of cuda.
Is this unified memory is accessible from both cpu and gpu directly?
Hello Mark,
in cuda-7.0/samples/0_Simple/UnifiedMemoryStreams/ the example uses cudaDeviceSynchronize() in the construction and destruction. The documentation mentions also when it is necessary and your full github in https://github.com/parallel... follows it and cudaDeviceSynchronize() is used exclusively right after kernel invocation.
My question is whether there would be any good reason to include cudaDeviceSynchronize() in your Managed base class right after cudaMallocManaged and before cudaFree. cudaDeviceSynchronize call is much shorter (<1%) than cudaMallocManaged, so there is no real penalty in performance. But, probably, that's not the main argument to do so or not.
Can't wait to see this implemented in blenders cycles rendering.
Basically, the programming model assumes that any kernel launched can be accessing any managed memory attached to the “global” stream, even if that memory was allocated _after_ the kernel was launched. This means that if you want to allocate managed memory and access on the CPU right away, you have to either make sure that all kernels have been synchronized OR you have to attach to the “host” stream when you allocate (i.e. do cudaMallocManaged(&ptr, size, cudaMemAttachHost)). The latter choice then requires that the data be attached to “global” or a specific stream if it needs to be accessed from the GPU.
What about the unified memory first Maxwell had in the roadmap and then Pascal got? Is it something else? Hardware based unified memory? Will Pascal get it or has it been delayed again and now Volta gets it?
Hi,
I wonder whether multi-level pointer is supported on unified memory?
Do you mean pointers to pointers? In CUDA, pointers are just pointers, so yes, this works.
Very helpful post—thank you. One question: In the linked
list section, you said “The program can maintain a single linked list, and list
elements can be added and removed from either the host or the device.” I am
writing a program to do precisely this: initialize a linked list on the host
and manipulate it on the device (by allocating/rearranging elements on the
device). I am having my data elements inherit from the “Managed” class as you
have done. This approach works fine on the host; however, when trying to
allocate elements from the device, I get the error that calling a host function
(Managed::operator new) from a global function is not allowed.
How can I allocate elements from the device and add to the
list? I thought about making a device version of the Managed constructor, but I
don’t know how/if this is even possible.
I am using Tesla K40. I find when I simply change the device memory to unified memory, the kernel execution is 100 times slower. The kernel needs to be executing multiple times. I am confused because I thought this technique should implicitly copy data from host to device. But the performance seems to indicate the job is repeatedly reading from host to device for every time the kernel is launched.
Before Pascal GPUs (e.g. Tesla P100) that was true, but Pascal and later GPUs have hardware page faulting capability so that the memory does not have to be synchronized before each kernel launch.
Usually this is an indication that what you are measuring is different between the two cases. With device memory and explicit memory copies, you may not be including the copy run time in the timing. But when you switch to Unified Memory you may not be including the transfer from host to device in the kernel run time.