Using cuda for solving metallurgical simulation it's a matallurgical program for simulating

My Name is Saeed Mohammadi. I’m a C++ programmer and beginner in CUDA. I’m working on a simulating program for simulating the Solidification process in Metallurgy. Here, we have a completed program which is working correctly. As we found the CUDA on nVidia web site, we started discussing about how we can increase the performance!

I have a 3D array of x, y, and z which are in arbitrary size. The size of (xyz) is usually large. I have some arrays for saving the “Temperature of the point(x, y, z)”, “Material ID”, and some parameters such as “NewTemperature”, “K”, “Ro”, “Cp”, “Fs”, and etc. Some of these arrays are made at first and used in the program but some of them are made at first but must be changed many times. Also I have some other parameters that must be generated during the execution. In our CPU-Based program, we have a while and loop like this :




    for( z;  z < limit.z  ;  z+=direction.z)


for( x; x < limit.x ; x+= direction.x)


for( y; y < limit.y ; y+= direction.y)


                //starting the process to find the maxTemp


             //changing the direction of y


         //changing the direction of x


     //changing the direction of z



What is the goal?

In this simulation we have a mold which is turned to a mesh with x, y, and z. we are looking to find out when the cast’s temperature reduces to 540c. Only the non-boundary position could be in the process, so x=0 and x=limit are not involved. For an any internal position like (x, y, z), we have 6 neighbor, for example: x+1, x-1, y+1, y-1, z+1, z-1

The loop continues until the maxTemp increases so there is at least hundreds of iterations. At the end of each loop, we change the step forward or backward with the ‘direction’ parameter. ‘direction’ has the value of “1” or “-1”. In the loop over ‘y’ we find the 6 neighbors of base position, their “Material ID”, their temperature, and the value of “K” for each of them by calling a function with these parameters as arguments. Also you have to check their “Material ID” for some calculation and some other local parameters.

These are just some local variables with lifetime of a thread. I have to find the new temperature of the base position and replace it to its corresponding index in the “NewTemperature” array which I think are in Global Memory and have to be returned in Host.


I tested the old program and start to code it with CUDA. For this reason I’ve found some way for calling Kernel:

  1. Performance=Disaster: loop on host, call single thread kernel

  2. Medium Performance: call a global kernel where the “z” values are threads. For each threadIdx.x or “z” call a device kernel to loop over the “x” and “y”

  3. Best Performance but incomplete: defining nBlocks & BlockSize, call a global kernel where each threadIdx.x is a random position and do calculation.

Errors & Problems

When I use the second way at above, my program works correctly with true result but it’s so much slower than the old CPU program. So I started coding in the third way. I know that using loops and if-else conditions makes disaster. I tried to reduce redundancy and deleted some loops and if-else conditions but still there are few ones. It seems to be all right but it catches 2 errors:

  1. Kernel invocation error: “too many resources requested for lunch”

  2. Writing to my NewTemperature [index] array fails.

My Question

This Solidification program has smallest data and parameters than another series of my program in metallurgical simulation. If I change the structure of input data, maybe I will pass this error. In the middle of my if-else and loops, when a unique conditions occurs, I have to change the value of NewTemperature[index] where index is produced by blockIdx.x*blockDim.x+threadIdx.x . It cannot set the value. So:

Question: Is there any limitation of calling device kernels? What about the input arrays?

I know that I cannot describe my all process in 1 page and also you don’t have much time to read this! But I’m counting on your experiment. Thank you for your time and consideration. I look forward to a kind response from you.

Yours Sincerely,

Saeed Mohammadi

“Too many resources requsted for launch” means that your block size is too big. Remember that you have 8192 registers and 16384 bytes of shared memory and if your kernel uses, for example, 40 rgisters, you won’t be able to launch blocks larger than (8192/40)=204.
In general I would suggest you to fix block dimensions (not use “z” for number of threads).

If you have poor performance you should check basic optimization strategies (see Programming Manual): use coalesced global memory accesses, consider using shared memory, consider using textures for fetching data and so on.

Your problem description is a bit vague. It sounds like you’re computing a function over XYZ position, and you keep iterating until some threshold temperature is reached.
Is the work being done basically a search for that temperature, or is the real work some kind of expensive formula (and looking for a threshold is just a minor aspect).
Also, most critically, does your compute use information from a position XYZ as well as perhaps the 6 or 26 nearest neighbors? [That changes your behavior a lot!]

For a “generic” kind of heat diffusion computation over a grid, a good way to start would be to break the domain up into subcubes, and each CUDA block computes one subcube.
You read in old temperatures/XYZ data from a 3D texture fetch (which stays local since your blocks are small ranges in space). You do your compute, and write your updated value into a new global memory location. (Fancy tricks to cache writes might be useful if your data is small). At the end of the iteration, each block writes its max temp over all its positions to a global memory array.
The kernel finishes, and now on the CPU you launch the kernel again, using the just written array as your new input texture… double buffering the array basically. You can scan through the max temps on the CPU to see if you’ve hit the threshold (this can be done on the GPU instead if you like but that’s probably not necessary.)

Depending on your compute, this will likely all be memory bandwidth limited, so using texture lookups is important. If you’re REALLY memory bottlenecked, then you have to start making your own caching strategies with shared memory, which isn’t hard but takes extra planning and code.

Dear SPWorley & AndreiB


Thank you for your time and consideration. I decreased the blockSize from 512 to 256 and it worked! now it’s working with a good performance. Now I’m working on Huge Performance :thumbup: . I will tell you more about my program.

Thanks again