advice on porting Monte Carlo code from Cell Broadband Engine Processor to CUDA?

I am planning to port some Monte Carlo simulation code from the Cell processor architecture to CUDA. So far I’ve just been reading about CUDA and am trying to devise a migration strategy.

Overview of Cell processor:

  • The Cell processor has 8 so-called “SPU” cores. Each SPU has 128-bit wide SIMD registers (i.e. the SPU can perform 4 single precision floating point operations in a single instruction). This is similar to the 32 thread wide “warp” found in CUDA, except that CUDA automatically takes care of the “SIMDization” of scalar operations.

  • The Cell processor does not have CPU caches. Instead each SPU has a 256KB “local store,” and all data and code being used at any given time must fit in this 256KB chip memory. The SPU “local store” is similar to the “shared memory” and “local registers” in CUDA.

  • Finally, each SPU is equipped with a “memory flow controller” that allows the programming to perform Direct Memory Access (DMA) to the system RAM (up to 32GB RAM). The programmer must manually move data between the SPU local store and system RAM, and is responsible for manually implementing double buffers for hiding memory latency (overlapping memory access with computation). There is really no equivalent of this in CUDA since it is automatically handled by the GPU hardware. Accessing “device memory” and “host memory” on the CUDA architecture the closest equivalent to performing DMA on the Cell architecture.

Now, to describe my simulation code. It basically looks something like this:

struct SimulationState

{

	float time;

	float X, Y, Z;

}

SimulationState sim_state; 

for each scenario:

{

	// initialize sim_state

	sim_state.time = 0.0;

	for each time step:

	{

		// advance sim_state through time

		sim_state.time += 0.01;

	}

}

In reality my “simulation state” struct is much more complex and contains about 1 to 5KB of data. On the Cell processor, sim_state is always kept on the local store, so all data is completely on-chip during the entire simulation … DMA access only occurs at the beginning of the simulation (to initialize sim_state) and at the end of the simulation (to send results back).

I think I have a pretty good idea of how I could port this code over CUDA. It shouldn’t be too difficult with the of the SDK examples. However, a major complication arises due another feature of my Cell code. Normally, I just want to make the simulation go as fast as possible and so I avoid DMA memory access as much as possible, preferring to do all computations on-chip. BUT in some cases it is absolutely critical to save the entire simulation state after each step. In this “verbose mode,” the Cell code looks as follows:

for each scenario:

{

	// initialize sim_state

	sim_state.time = 0.0;

	for each time step:

	{

		// advance sim_state through time

		sim_state.time += 0.01;

		copy sim_state to system RAM using DMA commands

	}

}

With the Cell processor this is easy to do. If I’m in verbose mode then I just issue a DMA command to copy the sim_state struct to RAM. I can also double buffer the DMA commands in order to hide the memory latency as best as possible. The total amount of data that eventually ends up being copied to RAM is about 2-5GB.

My question is, what is the best way to do this using CUDA, keeping in mind that performance is a very critical priority? It is also critical that I be able to store the entire simulation state in a single struct so that it can be easily analyzed in “verbose mode.” But I also want to maximize performance on the CUDA hardware, so I want sim_state to be store either on the CUDA thread’s local registers or shared memory, so that there is no device memory access as the simulation moves through time. Now I understand that if sim_state is too large for a given thread block then CUDA will automatically begin “register spilling,” which will force data to be temporarily moved to the device memory. So my idea is to just keep the same big simulation state struct that I already have (even if it causes some register spilling), and just copy it to a global memory array as needed when in “verbose mode.” But I’m worried that either I won’t be able to launch enough threads due to maxing out the registers and shared memory per thread block, and/or my performance will be killed by the automatic register spilling? Will I be able to have at least 64 threads per thread block if I have a 1-5KB sim_state local variable in each thread? Will the hardware automatically hide the register spilling latency, or is there any way to hide the latency of the register spilling (since obviously I do not need all 1-5KB of data in each instruction)?

Any advice is much appreciated.