Degenerate tetrahedral meshing...

Well, I did it. I created CUDA code that creates a degenerate tetrahedral mesh of a set of points. Here “degenerate” means that the mesh will be successfully built even if points are on the surface of a tetrahedron. In most algorithms, points like these are perturbed into no longer being on a tetrahedron so that’s why my algorithm is special.

My work is largely based on gDel3d. Here’s a link for a lot of good reading :

Here’s my code. The main file is

I’m getting massive performance drops with my calculate_point_info() kernel which is called from my redistribute_point() kernel using dynamic parallelism.

Here’s regulus’ basic pipeline :

  1. Allocate points
  2. Sort points by Peano-Hilbert key
  3. Allocate all-encompassing tetrahedron
  4. Nominate points for insertion
  5. Make sure all points will not create conflicting tetrahedra
  6. Insert points into appropriate tetrahedron
  7. Redistribute remaining points amongst new tetrahedra

According to nvvp, my calculate_point_info() kernel has atrocious global memory usage (both reading and writing). Does anyone have any advice on how to improve my code’s performance?

Some of the code in the file is dead. If there any questions, I’d be happy to answer them.

As a first step, I would suggest using the restrict attribute on all pointers passed to global functions. This assumes of course that none of the memory objects passed via pointers are aliased. See also the Best Practices Guide.

On initial review, I don’t see why you would experience poor global memory write efficiency. The relevant code in calculate_point_info() appears to be:

const int curr_write_offset = redist_offset_begin + tid;
pa[curr_write_offset] = pa_starts[tid];
ta[curr_write_offset] = ta_id;
la[curr_write_offset] = loc;
fs[curr_write_offset] = fract_size;

The addresses are formed using the “base + tid” pattern, and this should result in perfectly coalesced contiguous accesses that provide great efficiency.

Can you clarify what exactly the profiler is telling you regarding these memory accesses? Clearly this kernel has little computation and significant bandwidth requirements, so will be memory bound. The best you can do in such a situation is (1) try to achieve best possible efficiency of the global memory accesses (2) use a GPU that provides plenty of memory bandwidth, such as a GTX Titan or Titan Black.

Ooh, that’s a really good idea. Thank you!

Btw, be really, really, really careful if you’re running this code.

I’ve learned, DO NOT run in cuda-memcheck. I’m on Ubuntu 14.04 and a really, really good way to halt your whole system is to run it through the memchecker. It’ll typically freeze my whole system.

Also, be careful if you do try running it through nvvp because that’s also prone to locking up the system.

Running it outside of debuggers and profilers, it’ll be fine.

Sorry, these are important things I’ve noticed while debugging.

What exactly does “freeze the whole system” mean? Running under the control of the tools mentioned may make the code run slowly, just like running code under valgrind can make run code very slowly. Running slowly could also create the risk of hitting watchdog timeouts if you are using the same GPU for the GUI.

By freeze the whole system I mean, my music player will stop, the cursor will stop moving. I have conky running in the background and that will stop updating.

Like, for some reason the code will complete without segfaulting but running under memcheck, I get weird behavior with my system. I’m not sure if that means my code is super terrible or not.

One thing, cuda-memcheck does probably just make it slower but just last night, I ran cuda-memcheck --tool racecheck and I had to do a hard restart on my system.

Interestingly enough, it was complaining about thrust algorithms until it all froze up.

Do you have a system with a single GPU? Is that GPU used for both CUDA kernels and the GUI? Does your code check the status of every kernel launch? If kernels time-out and are being killed by the watchdog timer, and the app just continues issuing more CUDA work to the GPU I could imagine that unspecified “badness” will eventually happen with the driver. I have seen “lockups” caused by incorrect CUDA app behavior of that nature before, but if I waited 10 minutes the machine would typically recover.

Oh, is the problem because I’m only using one GPU? Oh my God, that makes so much more sense now. And yes, I’m using a single GPU system which is simultaneously running the GUI and CUDA code so that might also be why my code can get slower.

I’m imagining that if my code made out of bounds reads, it’d just crash, but then again, undefined behavior, right?

Alright, the next time I’ll give my computer more time to recover. I should also add some kernel checks as well… Hmm… Talk about amateur hour, right? XD

Looking at the code some more, it seems read efficiency could be poor since the loads go through a level of indirection, creating an irregular gather pattern.

const point p = points[pa_starts[tid]];
const point a = points[t.v[0]];
const point b = points[t.v[1]];
const point c = points[t.v[2]];
const point d = points[t.v[3]];

The use of restrict may help here if there is still reasonable locality. If the locality is poor and the final addresses essentially random then read efficiency will be poor and effective bandwidth could be down to ~10% of the theoretical throughput.

As for potential time-out events, I would check the return status of every launch, and terminate the app if a kernel fails. You can also try increasing the watchdog time-out limit. How to do that is system dependent.

In my case, points a through d can be read with poor locality. The points in the buffer are actually sorted by spatial locality and the only way to make sure each tetrahedron is made with spatially local points requires a huge algorithmic addition. But I had that similar thought earlier, that it’s because the tetrahedra do not exhibit good spatial locality that my code is suffering.

Oh man, adding the spatial refinement part of the code would be monumental. But I’ve done this same algorithm on the CPU and I did notice a huge speed-up when I did add the spatial refinement part of the algorithm.

Thank you for your advice and questions so far.

I don’t know how much data you have in total. CPUs have large, low-latency last-level caches (e.g. 10MB - 20MB) and your data may well get a decent hit ratio with those large caches, and the low latency of the access is helpful with indirection schemes. By contrast, GPUs have throughput-oriented architectures with small caches and relatively high latencies. The majority of the silicon resources in a GPU is targeted at computation, while the majority of silicon resources in a CPU is targeted at fast storage.

How much speedup (if any) are you currently seeing for your code GPU vs CPU?

I’m not seeing very good performance when compared to GPU vs CPU.

I think the way I have it now, my code is so poor in performance that the CPU is still favorable. However, I based my algorithm on a paper that claimed 10x (at the most) the performance of cgal’s Delaunay triangulation.

I’m thinking that spatial refinement for the tetrahedra will be required for spatial locality between the data. As of now, the tetrahedra are made out of points that are not necessarily ideal so that’s one cause of the bad input.

The package you pointed to in your original post, gDel3d, claims reasonable speedup when moving from CPU to GPU. That makes me wonder why that does not translate to your derived code. Are your problem sizes bigger, did you change the code significantly? It might be interesting to check whether you can reproduce the performance claimed by the authors of gDel3d using their unmodified code.

What specific CPUs and GPUs do you use for your own work, and what is the actual performance you see ? What tool chain do you use on the CPU? Are you using the latest CUDA version? 6.5 is the last stable release and 7.0 is in the release candidate stage at this time.

I assume you have already performed due diligence checking on your build to make sure you are not building with debug flags (-G in particular turns off all optimizations in the compiler) and you are building for the correct GPU architecture.

Right, well gDel3d does something interesting things with its data to make it so fast, I think.

The biggest is, when his code attempts to insert a point into the mesh, if it’s on a tetrahedron, he perturbs it so that it isn’t anymore. My code doesn’t do that. I handle points being on a tetrahedron directly.

The biggest challenge of this code is using the same instructions for truly different types of data.

I’m using a 750 Ti, built with CUDA 7.0 and I am definitely not using the -G compiler flag and building for cards with CC of 50.

It’s that gosh darn calculate_point_info() routine. I stole gDel3d’s geometric predicates which are really just Shewchuks geometric predicates modified for the GPU. I think one thing that might is to reduce the number of threads used when I call calculate_point_info() as I’m still using tpb which in my code is threads per block which is 256. I’m thinking of dropping it to 64.

Edit : I think I would get better performance if I didn’t have an X server running at the same time. I’ve noticed when I run my code for a point box of 16x16x16, Xorg maxes out a core of my CPU. I don’t think my code plays well with other stuff using the card.

Given that Maxwell-class GPUs support higher thread-block counts per SM, lowering the tread count per block may enable you to squeeze a bit more performance, but I would be surprised if it is more than 10%. I would suggest systematically trying thread counts in increments of 64.

However, if that one kernel is the biggest bottleneck in the code, you would want to focus on memory throughput. Per my previous analysis, the code should do fine for writes due to a nice regular access pattern, but likely suffers horribly for loads due to a “random access” pattern. The profiler analysis should be able to confirm that assessment if I didn’t make a mistake in my paper analysis.

While the GTX 750 Ti isn’t exactly slow, its memory bandwidth is not very impressive at 86 GB/sec. Depending on your host system configuration, the system memory has bandwidth of around 25 GB/sec [two DDR3 channels] or even 50 GB/sec [four DDR3 channels]. You may want to try your code on a GTX Titan Black with memory throughput of 336 GB/sec if you can get a hold of one (friend, colleague, public access) to get a second data point.

Lol I wish I knew someone with a Titan Black.

I did notice one thing about my mesh. I print it out at the end (I commented it out on github so if people installed it, it wouldn’t just dump everything to stdout).

I think I’m thrashing my cache like no other. Each tetrahedron contains locations of the points in the point buffer. Some of the tetrahedra are made up of points 1023, 5, 546, 1432. There’s no cache line in the world where those’d all be together.

I think I could also look at how to prevent this much thrashing.

Edit : What if I stored my point data (the actual point buffer) in texture memory instead? Once it’s written, it literally never ever needs to be altered ever again.

Edit edit : I’ve also been thinking of ordering my points by a Z-order curve instead of the Peano-Hilbert curve.

If you follow up on my suggestion to use the restrict attribute with pointer arguments to your kernels, there is a high likelihood that loads for read-only objects will turn into LDG instructions that read data through the texture path, without requiring you to set up explicitly bound textures (that have size limitations to boot). You can use cuobjdump --dump-sass to check what kind of loads you are getting. LDG indicates loads through the texture path, LD.E indicates regular loads. Obviously read-write type data will have to go through the regular load path.

Oh if that’s the case then using restrict didn’t give me much of a performance boost at all.


I think the only solution here is an algorithmic one. Thank you for continuing to offer advice though.

Recent versions of nvcc are sometimes able to determine the information necessary for the use of LDG without the presence of restrict, so you may have had LDG in your executable already without being aware of it.

Conversely, adding restrict does not guarantee that LDG will be used as this alone may not be sufficient to establish that use of LDG is safe. The only way to make 100% sure that LDG is used is by means of the __ldg() intrinsic.

Note that going through the texture path isn’t a cure-all. It may help is there is still some locality, but the cache is still fairly small and shared by many threads. Based on what I know about your code so far, it seems to have an almost total lack of locality in the read accesses, so none of the caches in the GPU are likely to provide much benefit.

You may want to take advantage of NVIIDA’s offer to test drive a K80:

Oh, that’d be amazing! Thank you for the link!