CUDA-Based Level Set PDE Reinitialization please point out where's wrong

Hello everyone, I am trying to use CUDA to initial a signed distance function by forward Euler PDE method from a zero-contour SDF, here is my CUDA core code

texture<float, 3, cudaReadModeElementType> Tex3DRef;

#define M_PI 3.14159265358979323846f

__device__ float Heaviside(float Phi, float Eps)

{

	if( Phi > Eps )

		return 1.0f;

	if( Phi < -Eps )

		return 0.0f;

	return 0.5f*(1.0f + Phi/Eps + 1.0f/M_PI*sin(M_PI*Phi/Eps));

}

__device__ float Sign(float Phi, float Eps)

{

	return 2.0f*(Heaviside(Phi, Eps) - 0.5f);

}

__global__ void Reinitialization3D(float* Phi, float Eps, unsigned Res)

{

	int BlockSize = blockDim.x*blockDim.y;

	int BlockOffset = blockIdx.x*BlockSize;

	int GridOffset = Res*Res*blockIdx.y;

	int ThreadOffset = threadIdx.y*blockDim.x + threadIdx.x;

	int Idx = ThreadOffset + BlockOffset + GridOffset;

	int Z = blockIdx.y;

	int XY = ThreadOffset + BlockOffset;

	int X = XY % Res;

	int Y = ( XY - X ) / Res;

	float delta = 1.0f / (float)Res;

	float dt = delta*0.001;

	

	float3 Coord = make_float3( (float)X + 0.5, (float)Y + 0.5, (float)Z + 0.5 );

	float C = tex3D( Tex3DRef, Coord.x, Coord.y, Coord.z );

	float L = tex3D( Tex3DRef, Coord.x - 1.0, Coord.y, Coord.z );

	float R = tex3D( Tex3DRef, Coord.x + 1.0, Coord.y, Coord.z );

	float D = tex3D( Tex3DRef, Coord.x, Coord.y - 1.0, Coord.z );

	float U = tex3D( Tex3DRef, Coord.x, Coord.y + 1.0, Coord.z );

	float B = tex3D( Tex3DRef, Coord.x, Coord.y, Coord.z - 1.0 );

	float F = tex3D( Tex3DRef, Coord.x, Coord.y, Coord.z + 1.0 );

	float dPhidX = 0.5*(R - L)/delta;

	float dPhidY = 0.5*(U - D)/delta;

	float dPhidZ = 0.5*(F - B)/delta;

	float Len = sqrt( dPhidX*dPhidX + dPhidY*dPhidY + dPhidZ*dPhidZ );

	float S = Sign(C,Eps);

	float Phi_one = C + S*(1.0f - Len)*dt;

	Phi[Idx] = Phi_one;

}

In my main program, I will copy newer data from Phi back to texture reference memory at each new step,

CUDA_MEMCPY3D XchgDesc;

	memset(&XchgDesc,0,sizeof(CUDA_MEMCPY3D));

	XchgDesc.srcMemoryType = CU_MEMORYTYPE_DEVICE;

	XchgDesc.srcPitch = Pitch;

	XchgDesc.srcDevice = Phi;

	XchgDesc.dstMemoryType = CU_MEMORYTYPE_ARRAY;

	XchgDesc.dstArray = Tex3DArray;

	XchgDesc.WidthInBytes = LineSize;

	XchgDesc.Height = Height;

	XchgDesc.Depth = Depth;

	cuFuncSetBlockShape(Func,16,16,1);

	while(k--)

	{

		cuLaunchGrid(Func,Width*Height/256,Depth);

		cuMemcpy3D(&XchgDesc);

	}

	cuMemcpyDtoH(TentField.data(),Phi,TotalSize);

	cuMemFree(Phi);

	cuArrayDestroy(Tex3DArray);

But the 2D result seems a little wrong, because the size of each dimesion is 1.0, so the center distance should be -0.5 or some others larger values. Is there any experienced people would like to help me ? Thanks !

I have only a sketchy understanding of what the reads from texture memory will give you, but the differencing scheme you are using looks suspiciously like central differences. Of course, I could be misunderstanding what the texture read does. If you are using central differences, you probably shouldn’t be. You probably want to use a upwind scheme (perhaps like Godunov’s scheme) or an ENO or WENO approximation to compute the derivatives. A poor initial signed distance field can contain numerical shocks which will cause a lot of numerical instability in a simple central scheme.

Again, I might be misinterpreting your code, but at first inspection, your CFL criteria looks somewhat unusual. I would have though that a suitable time step in 3D on a uniform grid at unit speed would be something like delta/6 rather than delta/1000, which means you are going to run a huge number of integration time steps to get the signed distance field to equilibrate. I could be completely off base with this as well.

I have done a lot of work with the LSM, but I haven’t yet tried a CUDA implementation. A CUDA based fast sweeping method is on the to-do list…

I use texture reference to simulate the boundary, let GPU to clamp the values on boundary. The CFL dt is for testing, I think the smaller dt should be working but with more iteration times.

Thanks for your reply and I will try the HJ WENO.

My point about the CFL calculation was mostly that with such a short time step, you are going to have to run a lot of time steps and you are going to have to have a suitable metric to know when to stop the iteration. If you are stopping too soon, the the signed distance field won’t be evolved to the “equilibrium” solution and the results will be incorrect.

I think a fifth order HJ WENO is probably overkill just for distance or redistance calculations. A first order or second order ENO scheme with a second order TVD runge kutta time integrator should be sufficient to get a very useful signed distance field. There is an older paper on the redistancing equation by Sussman and Fatami (probably in the J. Computational Physics) which covers the topic well. I usually initialize signed distance fields with a fast marching method and have one occasionally found it necessary to go to second order spatial approximation , let alone higher (there are two good papers by Chopp on the FMM and initializing or reinitializing distance fields that are also worth a look).

Yes, for testing I manually choose some iteration times to test. I have found the paper named “An efficient, interface preserving level set re-distancing algorithm and its application to interfacial imcompressible fluid flow” as you said. And what do you think of FMM implemented on CUDA ? I think that would be a little difficult to do on CUDA, maybe I am wrong. Is there any fast GPU implementation of LSM ? I know the lsmlib, but it’s based on CPU now.

I need a SDF from a tent function as the paper “Volumetric Methods for Simulation and Rendering of Hair” from PIXAR. They solved out the level set function from the initialized scalar function.

Fast marching methods are really unsuited to parallel implementation generally and in CUDA doubly so. There is one version of the FMM for parallel distributed computers that I am aware of, but its approach is even further away from the CUDA paradigm than the standard heap sort model. Fast sweeping methods are considerably more promising, and there are a couple of papers about “old school” gpgpu implementations in openGL. In the sweeping methodology, you can compute each sweeping direction in parallel and then use a parallel reduction to combined the monotone solutions.

Right now I am still using a serial FMM code on the host CPU, which is truly nlog(n), because I don’t believe I yet have a redistancing algorithm for the GPU that is any faster.

In fact, I am also thinking about the memory, because the volume could be quite large (256^3), and we could allocate > 2G memory in 64bit mode but can’t do it on GPU yet.