My CUDA Programming Lecture and Teaching of Poisson Parallel Surface reconstruction in a Summer Scho

Since NVIDIA has this topic I think its not bad putting here my 4:30 hours long lecture at a Summer School. Please leave your comments inside here. I will be tracking this topic.

This video is for educational purposes and should not be used for profit.

Just a nice image that what I teach actually work…this is the adaptive octree on the Stanford bunny. Its a 5 level octree and levels 1, 3, 5 are displayed. I still have the privilege to develop what I think…

Just a note. Reading from the specific paper having only as friend Google search is a torture in the brain. Only people with strong nerves should attempt it and moreover code it…A little gossip. :-)

I am also tempted to include the host code that builds this nice image. The kernels to do it are left as an exercise. I learned from this Forum a nice way to use double pointers. I believe we should invest a lot in demonstrating not to code lame code. This is why I am actually devoting some time to contribute in this Forum.

dev_octree_node** dnodeArrayd;
	cudaMalloc((void**)& dnodeArrayd, (D+1)*sizeof(dev_octree_node*));
	dev_octree_node** supportNode = (dev_octree_node**) malloc((D+1)*sizeof(dev_octree_node*));
	cudaMemcpy(supportNode, dnodeArrayd, (D+1)*sizeof(dev_octree_node*), cudaMemcpyDeviceToHost);
	dev_octree_node* dnodeArray;
	int* dnodeBaseAddress;
	int* startDepthIndex = (int*) malloc((D+1)*sizeof(int));
	int totalNumbNodes = 0;
	for (int d = D; d >= 0; d--){
		cudaMalloc((void**)& dnodeBaseAddress, numbUniqueNodes*sizeof(int));
		get_numbers_base_address(duniqueNodes, dnodeBaseAddress, numbUniqueNodes);
		thrust::device_ptr<int> thrstdbaseAddress = thrust::device_pointer_cast(dnodeBaseAddress);
		thrust::inclusive_scan(thrstdbaseAddress, thrstdbaseAddress + numbUniqueNodes, thrstdbaseAddress);
		int numbNodesd = d > 0 ? thrstdbaseAddress[numbUniqueNodes - 1] + 8 : 1;
		cudaMalloc((void**)& supportNode[D-d], numbNodesd*sizeof(dev_octree_node));
		populate_node_array_from_base_address(duniqueNodes, supportNode[D-d], dnodeBaseAddress, numbUniqueNodes, numbNodesd, currentNodeDepth);
		numbUniqueNodes = numbNodesd / 8;
		if (d > 0){
			cudaMalloc((void**)&duniqueNodes, numbUniqueNodes*sizeof(dev_octree_node));
			construct_next_level(supportNode[D-d], duniqueNodes, numbNodesd); 
		startDepthIndex[D-d] = totalNumbNodes;
		totalNumbNodes += numbNodesd;
		currentNodeDepth *= 2.0;
	cudaMalloc((void**)& dnodeArray, totalNumbNodes*sizeof(dev_octree_node));
	for (int i = 0; i <= D; i++){
		if (i < D)	cudaMemcpy(dnodeArray+startDepthIndex[i], supportNode[i], (startDepthIndex[i+1]-startDepthIndex[i])*sizeof(dev_octree_node), cudaMemcpyDeviceToDevice);
			cudaMemcpy(dnodeArray+startDepthIndex[i], supportNode[i], sizeof(dev_octree_node), cudaMemcpyDeviceToDevice);

The code to build the Look Up Table for the neighbor information is this :

int c[6][6][6];
	int p[6][6][6];
	int LUTparent[8][27];
	int LUTchild[8][27];
	int numbParent = 0;
	for (int k = 5; k >= 0; k -= 2)
		for (int i = 0; i < 6; i += 2)
			for (int j = 5; j >= 0; j -= 2){
				int numb = 0;
				for (int l = 0; l < 2; l++)
					for (int m = 0; m < 2; m++)
						for (int n = 0; n < 2; n++){
							c[i+m][j-n][k-l] = numb++;
							p[i+m][j-n][k-l] = numbParent;
	int numbLUT = 0;
	for (int k = 3; k > 1; k--)
		for (int i = 2; i < 4; i++)
			for (int j = 3; j > 1; j--){
				int numb = 0;
				for (int n = 1; n >= -1; n--)
					for (int l = -1; l <= 1; l++)
						for (int m = 1; m >= -1; m--){
							LUTparent[numbLUT][numb] = p[i+l][j+m][k+n];
							LUTchild[numbLUT][numb++] = c[i+l][j+m][k+n];

You can now build with my Lecture a very efficient algorithm for Marching Cubes. It can be the theory of H’oppe’s (I am putting the necessary tone :-) approach, you have the right cubes. Good Luck.

Also the feedback I got from students is : CUDA is easy, OpenCL is complicated. What you do in CUDA with 2 lines takes 10 lines in OpenCL. So CUDA is the way to go to Universities and to students. The main problem with CUDA is that it is NVIDIA specific, it is adapted for NVIDIA products. What I see as future for OpenCL is that vendors create languages on top of OpenCL targeted for their machines. If I have an AMD and I want to code for AMD why use a desctiption which is targeted for every machine…If I have an NVIDIA it is rather strange coding in OpenCL, it is like the priests who use a whip on themselves…

Thank you for posting.

I see no practical reason to use OpenCL. While AMD GPUs might be a slightly better value at times, when it comes down to actually using the available GPU capabilities CUDA is a better choice.

Yes, AMD GPUs are better for bitcoin mining, but ASICs are the new standard so who cares?

Nvidia spent the time and money making a good product, and marketing it directly to developers. They sponsor quite a bit of research and support the community directly with code samples. They sponsor online classes, and within those classes run contests for the fastest CUDA code(with K20s as prizes). I do not see AMD doing any of these things.

Also does OpenCL have as a standard equivalent libraries such as thrust, cuBLAS, and cuSPARSE ?

I use all three of those frequently, and they come with the SDK.

What about double precision with OpenCL? The K20 kicks ass in the 64 bit arena, and that is more useful than most people think. Half the code I have been writing lately using 64 bit numbers.

Plus, at the end of the day, I can also be pleased the Nvidia GPUs dominate the Video game benchmarks;

Take a look at the top 6 entries.

So I can work and play with the same set of GPUs, knowing I am getting great performance, and the support to use the product at maximum potential.

Thank you for your comment. Although I try to avoid cuSPARSE because it is memory bandwidth limited I do appreciate all other libraries, especially Thrust and cuBLAS. About 64bit arithmetic I used it in my previous project which I am submitting it for publication and its on Hyperspectral images.

From your writing I realize you are a young free spirit, we do need people like you to show their creativity in Parallel Programming. I guess you can make us bluss sometimes doing things in an exciting way.

Also a warning to a company. Be careful Microsoft Asia has made the specific way to build the adaptive octree a patent. Search for it. You have to pay Microsoft in order to use it commercially.

I will update this folder in June 2014. My Master student will be working on this topic. My purpose was to make him a ground he can seed on. Also due to my devotion to global education I have put it here also to help other students. I will also make two more threads on two exciting topics that two other of my students are working on. All in June 2014 and hopefully if they succeed.

I am happy to say that the experience programming in CUDA in Visual Studio 2012 has become a nice one. Intellisence is finally working as it is supposed to work in C++ and not just C# as in VS2010. It is highly recommended if you have no other choice than Windows to use VS2012 with CUDA 5.5. Still not perfect maybe it tries to much to predict or not but not exactly where it should be (I am talking about intelisence).

And a lecture in reconstruction can’t be complete without showing the reconstruction of the famous Armadillo. In order to reach this point there are plenty secrets in coding which I keep them for my University and a future paper. Yes they pay me enough to work this hard and produce these results…

With this class dismissed, my lecture might have bored you to death but it gives this result. I consider this image also as a scarecrow to trolls. Many exist unfortunately.

Reading the article : Give Better Presentations by Thinking of Yourself as a Mentor, I searched to find a nice yoda 3d model to reconstruct it with a prototype idea I am trying to publish. I found it and I read also how Paolo Cignoni’s MeshLab is doing uniform remeshing…which also gave me an excellent idea because in laser scanning you might have a triangulation of the range scans…Anyway here is my reconstructed model of yoda…

I am using MeshLab a lot I should say but just to set things in scale proportion. What MeshLab does in 5 minutes I can do in 1sec with this software I created…So HPC is the way to go in Computer Graphics…

Since I believe that some of this paper exercises can be given in Olympiads I will pose the problem of constructing the look-up table that the authors say, what I want to show is the beauty of Computer Science and how orthologic it is.

First why do the Authors say that for depth = 9 the size of the square LUT matrix is of dimension 2^10 - 1. This is simple is the sum of the geometric sequence 1 + 2 + 2^2 + … + 2^9. So the first question could be this. Now the challenge that me and my student have posed on who will create the best algorithm is (respect your student, he/she is not a slave work as a team) :

//Use a queue and a vector centers and LUT, initialize properly
//Get the centers in proper order :
int pow2 = 1;
float C = 0.0;
float W = 1.0; 
for (int d = 0; d <= D; d++){
    for (int i = 0; i < pow2; i++){
        Cq = queue.pop();
        queue.push(Cq - W/4);
        queue.push(Cq + W/4);
    W *= 0.5;
    pow2 *= 2;

int pow2i = 1;
for (int i = 0; i <= D; i++){
   offseti = pow2i - 1;
   int pow2j = 1;
   for (int j = 0; j <= D; j++){
      offsetj = pow2j - 1;
      for (int k = offseti; k < offseti + pow2i; k++)
         for (int l = offsetj l < offsetj + pow2j; l++)
            LUT[k][l] = inner_product(,W/pow2i], ,W/pow2j]);
      pow2j *= 2;
   pow2i *= 2;

Voila. Observe the beauty of algorithmic design. Everything in a compact algorithmic design…I consider now my lecture complete. When you see in a paper 4 Asian people from a Company/University that hire only top level then you have two options, run, or heavy yoga. :-) Speaking about Asia to new people who enter into the field : good luck in surpassing the “Ninja Gap”…:-) well its true isn’t it? Asian people work twice as hard as an average person…:-) Sometimes I try to do the same. Very difficult.

An example of having the GTX-Titan card with its 6GB of memory :
It can handle the famous Lucy model 15M of points in less time than it takes to load the datapoints in the CPU buffer from the disk. So the GTX-Titan is truly a beast. There are some issues on SM=3.5 concerning that the threads need to be lighter than SM=3.0 I have reported this problem and I am certain NVIDIA will fix it. Needless to say do not try this reconstruction in the CPU on day, do it at night and see the result in the morning…

Edit: Of course for professionals with a budget the Quadro K6000 is the best solution up to this date of the message. I needed to experiment with the GK110 processor so the GTX-Titan was the best solution for my pocket and as you can see it can really thrive and I am really prowd of maxing this GPU out and reaching it to its limits. Unfortunately Thrust could not follow in this max-out it hanged I had to create my own kernels for everything…
It will be also a challenge coding for SM=3.5.