How to optimize a memory intensive ray-marching kernel?

Hi,

I am writing a special kernel for a ray marching-like application. My aim is to use ray marching to generate a shadow map-like texture for a height map terrain. The setting is like that: I have a 2D float array in the global memory, which is to be mapped to a Direct3D9 texture. I let a single CUDA thread to trace a single ray from the sun on the height map. For a NxN sized texture, there will be N rays; a single column of the array will be exactly filled by a single ray. For example the i.th ray will fill the i.th column of the array. Each ray will record “shadow height” informations into the column belonging to it as it marches through the terrain. For a 1024x1024 terrain, the 2D array’s size changes between 1024x1024 and 1415x1415 because of the sun ray direction. (I rotate the terrain such that the sun ray’s projection on the xz plane becomes parallel to the z axis and perpendicular to the x axis.) I am creating blocks of 256 threads in a 1D array fashion; such that the first block will process the columns [0-255], the second processes [256-511], etc. So, for a 1415x1415 array, 6 blocks will be enough.

This CUDA accelerated algorithm works fairly fast reaching 250-300 fps on my 3d terrain application with my GeForce GTS 450. But I want to optimize it further. Here is the kernel code:

__global__ void NewShadowKernel(RayMarchInitInfo* input,float* shadowOutput,size_t shadowMapPitch,int totalRayCount)

{

	const unsigned int tid = blockIdx.x * blockDim.x + threadIdx.x;

	if(tid >= totalRayCount)

		return;

	int i;

	int j;

	

	float Pi;

	float Ri;

	float dy;

	float* row;

	float Pi_1       =input[tid].P0;

	float firstDeltaY=input[tid].firstDeltaY;	

	float sample_x   =input[tid].firstSamplingPointX;

	float sample_z   =input[tid].firstSamplingPointZ;

	int colStart     =input[tid].columnStart;

	int colEnd       =input[tid].columnEnd;

	

	float dx=const_dx;

	float deltaY=const_deltaY;

	float dz=const_dz;

	

	

	//Init Algorithm

	Pi=tex2D(heightMap,sample_x,sample_z);

	Ri=Pi_1;

	//Main body of the algorithm

	for(i=0;i<=colEnd;i++)

	{

		row=(float*)((char*)shadowOutput + i*shadowMapPitch);

		//if the column has not yet started, write 0s.

		if(i < colStart)

			row[tid]=0.0f;

		else//i >= colStart

		{

			if(i==colStart)

				dy=firstDeltaY;

			else

				dy=deltaY;

			if(Pi_1 + dy > Pi)//Backfacing

			{

				//The previous height map sample can cast shadow on the current sample.

				//Update the current shadow height if the previous sample is higher than it.

				Ri=fmaxf(Ri,Pi_1);

			}

			Ri=Ri + dy;

			row[tid]=fmaxf(0.0f,Ri - Pi);

			//Get the next sample.

			Pi_1=Pi;

			sample_x=sample_x + dx;

			sample_z=sample_z + dz;

			Pi=tex2D(heightMap,sample_x,sample_z);

		}

	}

}

Here is what I have done to optimize it until now:

1-I recorded the original terrain height map data into a texture such that I am making all memory reads as texture lookups with tex2D functions.

2-I created a pitch linear buffer with cudaMallocPitch in order to store the data each thread calculates. Every thread which traces a ray writes information into a single column of this 2d buffer. All threads begin to operate on the first row of the pitch linear buffer. By doing that I aimed to make memory writes coalesced. Consecutive threads are writing into consecutive locations in a single row.

I tried to profile this kernel with the Compute Visual Profiler and I saw that the memory throughput of this kernel is 8.68 GB/s where the peak throughput is 57.73 GB/s. This means the kernel somehow uses memory inefficiently and can be optimized further. So, I am trying to see what I am doing wrong and I have some questions:

1-I thought that letting threads to write to the same row lets all memory writes to become coalesced but the low memory throughput makes me think otherwise. What causes this kernel to issue uncoalesced writes? There is an if-branch in the for loop; a thread can write “row[tid]=0.0f” where the another thread in the same warp can take the another branch of the “if” and could write “row[tid]=fmaxf(0.0f,Ri - Pi)”. Can this branch disturb the coalesced write to the global memory? In other words, if two threads in the same warp write to consecutive memory locations, but in different branches of the same conditional, will this still make a coalesced write?

2-After writing the data into the pitch linear buffer, I am copying it to the CUDA array which is mapped to a Direct3D9 texture, with the cudaMemcpy2DToArray function. I know that device to device memory transfers are fast, but still, I am copying 4 to 7 MBs of data in every frame. Is there any way such that I can directly write to the texture mapped CUDA array itself, not to an intermediate buffer? I believe this will make a huge difference.

3-This question is about the general behavior of the threads in the same warp when they encounter a if-branch. Exactly, how does CUDA handle threads diverging from each other in different branches of the same conditional? The programming guide says something about that the threads become serialized in this case but that didn’t exactly make any sense to me. Does it work so; say there are 3 different branches in an if statement; A,B and C. Some threads in the same warp take the path A, some B and the others take the path C. C takes more time to complete compared to B and such is B to A. So, the threads which take the A path exit from the if condition first and begin to wait for other threads. Then the threads of the B path complete and begin to wait for the the C threads. As the threads which took the C path returns, all threads in the warp become synchronized again and the execution continues. Does CUDA behave in this way, too?

3-Other than those, what can be done in order to speed up this kernel further? Any advice is welcome.

Thanks in advance.

Seems quite optimzed already. Don’t think the memcpy will be so expensive.
Not sure why is row[tid] overridden inside the loop though. Is it possible to use a variable “rowTemp” and write it once after the loop to row[tid]?

You might be tex2D-sampler bound but not memory bandwidth bound: happens with good texture cache hit rate. Using float16 or even uchar might improve performance then.

If you really want to optimze it to the max, it might be faster to implement this the classic way in direct3d: render a fullscreen quad to float texture and use the pixel shader. This would also eleminate the copy.
(Background: There was a paper from the Metro2033 developers where they compared direceCompute vs. pixelShaders for their heat diffusion solver. For their problem, directCompute was slower unless shared memory was used to save texture fetches).

I fixed that, the new code looks like this:

__global__ void NewShadowKernel(RayMarchInitInfo* input,float* shadowOutput,size_t shadowMapPitch,int totalRayCount)

{

        const unsigned int tid = blockIdx.x * blockDim.x + threadIdx.x;

if(tid >= totalRayCount)

                return;

int i;

        int j;

float Pi;

        float Ri;

        float dy;

        float data;

        float* row;

float Pi_1       =input[tid].P0;

        float firstDeltaY=input[tid].firstDeltaY;       

        float sample_x   =input[tid].firstSamplingPointX;

        float sample_z   =input[tid].firstSamplingPointZ;

        int colStart     =input[tid].columnStart;

        int colEnd       =input[tid].columnEnd;

float dx=const_dx;

        float deltaY=const_deltaY;

        float dz=const_dz;

//Init Algorithm

        Pi=tex2D(heightMap,sample_x,sample_z);

        Ri=Pi_1;

//Main body of the algorithm

        for(i=0;i<=colEnd;i++)

        {

                row=(float*)((char*)shadowOutput + i*shadowMapPitch);

//if the column has not yet started, write 0s.

                if(i < colStart)

                        data=0.0f;

                else//i >= colStart

                {

                        if(i==colStart)

                                dy=firstDeltaY;

                        else

                                dy=deltaY;

if(Pi_1 + dy > Pi)//Backfacing

                        {

                                //The previous height map sample can cast shadow on the current sample.

                                //Update the current shadow height if the previous sample is higher than it.

                                Ri=fmaxf(Ri,Pi_1);

                        }

Ri=Ri + dy;

data=fmaxf(0.0f,Ri - Pi);

//Get the next sample.

                        Pi_1=Pi;

__synchthreads();

                        row[tid]=data;

sample_x=sample_x + dx;

                        sample_z=sample_z + dz;

                        Pi=tex2D(heightMap,sample_x,sample_z);

                }

}

}

Still, I am seeing the same 8-9 GB/s global memory write throughput in the Visual Profiler. The texture read throughput is actually good, I am not at my machine at the moment, but it was around 20-25 GB/s if I remember it correctly and the texture cache hit rate was around %80. In this new code, I am synchronizing all threads before writing to global memory to guarantee a coalesced write but I didn’t see any performance increase. I am still in dire need of optimizing this kernel, because it is the core part of my master thesis project. Can I optimize it by using shared memory by any chance? And by the way, the kernel runs with Direct3D9 at the same time, which runs a terrain engine written by me. Can be the reason that Direct3D and CUDA blocking each other for video memory writes? Any advice is welcome.

Thanks in advance.

Suggestion:

dy=firstDeltaY;

//Main body of the algorithm

        for(i=0;i<=colEnd;i++)

        {

                row=(float*)((char*)shadowOutput + i*shadowMapPitch);

                //if the column has not yet started, write 0s.

                float data=0.0f;

                if(i >= colStart)

                {

                        if(Pi_1 + dy > Pi)//Backfacing

                        {

                                //The previous height map sample can cast shadow on the current sample.

                                //Update the current shadow height if the previous sample is higher than it.

                                Ri=fmaxf(Ri,Pi_1);

                        }

                        Ri=Ri + dy;

                        dy=deltaY;

data=fmaxf(data,Ri - Pi); // data always 0 here

//Get the next sample.

                        Pi_1=Pi;

sample_x=sample_x + dx;

                        sample_z=sample_z + dz;

                        Pi=tex2D(heightMap,sample_x,sample_z);

                }

                row[tid]=data;

        }