Emulator has warp size of 1 + other perperties issues

I brought this up here way back in 0.8 days and the manual section has not been updated with a gotcha for this problem. I believe it is important to have this in the documentation. SDK projects that use SIMD within a warp have a __syncthreads() between #ifdef DEVICE_EMULATION and #endif and if it is required within a divergent section then one sees (from dxtc)


        } __debugsync(); if (idx < 16) {


However one is still snookered if there is code relying upon SIMD within a warp which is within a loop, within a divergent conditional. I have such in production code and it works fine on hardware and is unemulatable on the Nvidia emulator. So the Nvidia emulator should set warpSize to 1 in device properties. All traditional CPUs have a warp size of 1. It is possible to do the conversion for a different warp size, however this is non trivial for transparent operation. I have implemented a syncWarp() for use in my emulator that does allow proper emulation of my code, so it is possible to specify and write a sync that works in divergent code in a sensible way, without having to be told which threads to wait on. Infact this is the way that convergence needs to work within a warp and the two problems are very similar. I believe the main __syncthreads() is mis specified as its function is really to hand convert SPMD (as Nvidia like to call it) to logical SIMD across a block. A __syncthreads() that does not require all threads to visit is much easier to use. The programming model is SIMD and the implementation forces us to manually insert syncs to enforce that model. I believe it can be done automatically and will be sometime in the future so programming will get easier. This is exactly the same problem as getting the emulator to correctly emulate a warp size that is not native to the hardware it is running on.

Since the emulator is a fundamentally different device to any hardware it should provide its own device name in properties, as it does if there is no hardware installed. While it should perhaps do as much as it can to emulate presumably device 0, it cannot do everything. If you want it to look similar then pull back the amount of available global memory to reflect normal operating conditions - where my card is being used for the display as well, 70Mb of device memory disappears. Perhaps one should be able to select among all the supported devices within the emulator.

On my system (FC7 x86_64) the clock() function seems to return 10,000 times the system clock interrupt which is 100Hz so the clockRate property should be set to 1000KHz.

I have mentioned before that properties should include the number of multiprocessors and the maximum number of blocks per MP, and since there was no acknowledgment from Nvidia that anything was going to be done, I repeat. It takes quite a bit of code to measure these parameters for a given device and it is necessary to configure your app correctly for a given device to get maximum performance.

Even though 1 have a dual core I don’t seem to get more than 1 thread running at a time (max 100% of 1 core utilisation??) when running the Nvidia emulator, so the emulator should probably have #MPs of 1.


ed: fix broken link