CUDA vs DX execution times DX GPGPU code --> CUDA = slower

I have a functioning GPGPU project which I implemented using DirectX. It’s a streaming app which takes batches of 64 float4 buffers which are 256x256 each and processes them. The memory access pattern and computation are somewhat complex, although the details of which are unimportant to this problem.

The implementation under DirectX fully processes a batch of GPU-resident buffers (64MB of data) in 8.84 ms (all DX overhead and GPU processing, time averaged over 100 batches).

My CUDA 1.1 implementation running on an 8800GTX does this:
for each buffer:
3x calls to cudaBindTextureToArray()
grid(4,1,1) block(64,1,1) (256 threads per buffer)
1 call to the kernel, which just returns (no memory access, no computation)

Buffers are allocated with cudaMallocArray();

So a total of 3*64 calls to cudaBindTextureToArray() and 64 calls to run 256 threads that do nothing.

This takes 5.974 ms per batch, which is all overhead and includes no data fetch, compute or data output. Timing is done with cutCreate/Start/StopTimer() with a cudaThreadSynchronize() before cutStopTimer() and averaged over 100 batches.

Removing the call to the kernel drops the batch time to 0.378 ms.

Timing for both DX and CUDA was taken on the same machine. The DX project is built for debug, the CUDA project is built for release.

So it appears that the the CUDA overhead alone takes 68% of the time the DX 9 code does the entire processing.

Is there something I’m missing?

More data:
The original GPU kernel call had these 4 parameters:
float4 * to device data
int
int
int

By removing these parameters, the time per batch is now 3.322 ms, or a drop of 44%.

I have to believe that I’m doing something wrong - the overhead can’t be that much worse than DirectX.

Make sure you call cudaThreadSynchronize before starting the timer as well. That way you don’t time cuda operations still being executed. Better yet, use cuda events to do the timing (see the programming guide or one of the new SDK samples, asyncAPI and simpleStreams both use event api for timing).

Also, when you launch a grid with just 4 blocks in it, you waste the majority of performance. Depending on the GPU, there are up to 16 multiprocessors, each capable of executing a number of blocks. Also, each multiprocessor can accomodate up to 768 threads concurrently, while your configuration uses 4x64 = 256 threads.

Paulius

Paulius,

Thank you for your reply. I put a cudaThreadSynchronize() call just before starting the timer and it had no effect.

Right now I’m not concerned about the % utilization. What I am concerned about is that by the time the CUDA kernel begins execution, the DirectX implementation is already half done.

Scott

I now have the processing under CUDA (21.63 ms) within spitting distance of the processing in DirectX (~18 ms). However, I had to move to much larger data sets in order to amortize the CUDA overhead over more GPU processing. Processing smaller data sets will still be much slower under CUDA than DirectX.

In order to achieve a similar compute time I must use 2D textures in CUDA (the texture cache is the key). The problem is that since I can’t write to a 2D texture with CUDA, I have to write to a linear buffer and copy the data into a 2D texture. For some reason, the device-to-device copy is running at 3.9 GB/sec and takes longer than the processing.

More details are here: [url=“http://forums.nvidia.com/index.php?showtopic=56025&st=0&p=307964&#entry307964”]http://forums.nvidia.com/index.php?showtop...64&#entry307964[/url]

I’m back to the CUDA code running 1/3 the speed of the DirectX code.

What sort of block and grid dimensions are you using now? I guess I never read your original post that carefully (sorry) but running only 4 blocks isn’t going to get you much performance since the device is capable of running ~100 blocks concurrently.

Is there any way you can put all 64 buffers on the the card and once and handle them all with one big kernel call? If I had to guess, I’d say that DirectX is internally batching up all your render calls and effectively doing that.

I have 6 different implementations now. The 2 fastest are these:

  1. brain-dead, like I would expect DirectX to operate: 2D textures, read all the data into registers (very few, doing math on the fly), let the texture cache do its thing (the big win), write the results out. All writes are coherent. Grid size is 64x64, block size is 16x16, and occupancy is 0.667. Other than the overhead of the CUDA calls (killer for small data sets), I would go with this implementation if I could either:
    a ) write directly into a 2D texture
    b ) copy from a 1D array into a 2D texture @ close to peak theoretical

I can do neither.

  1. 1D textures (per your bandwidth suggestion), maximum usage of the shared memories. I’ve played around a lot with the grid/block size, trading off block/grid values vs. PDC usage. The fastest configuration is also the optimal use of the PDC (minimal off-chip memory accesses) but requires just over 10k bytes per 64 threads, so occupancy is 0.083. Right now this solution only does the data movement, not the correct math. It does do enough math to ensure the data reads happen: non-PDC targets get used. The compiler/assembler doesn’t remove PDC-targeted reads.

All reads and writes are coherent. I don’t like the 16 float4 pad that’s necessary - it’s sizeable for small data sets.

There are a large number of warp serializations that I can’t see how to get around. Plus I don’t really understand how the number the profiler reports correlates to actual hardware behavior - I never got an expletive answer to this post: [url=“http://forums.nvidia.com/index.php?showtopic=52697&view=findpost&p=303684”]http://forums.nvidia.com/index.php?showtop...ndpost&p=303684[/url]

The app is a streaming one. The CUDA overhead (from the first post, this thread) is an insignificant % for the large data sets, but really bad for the small ones (doubling the latency).

What really kills me is that I wrote the app in DirectX in 2 days, it was easy, and it runs like the wind on both small and large data sets. And it’ll run on practically any currently-selling GPU by any vendor. The amount of time I’ve spent trying to match its performance using CUDA is embarassing.

I’m assuming you didn’t learn DX the same week you learned CUDA :-)

There are a number of ways you could have gone astray, it can be hard for others to armchair quarterback without seeing the code.

Cheers,

John

I’m assuming you actually meant to say coalesced?

How about posting code? Your description wasn’t detailed enough for me to really know for sure exactly what your code is actually doing. Given the questions that remained, I didn’t want to try to guess. At one point you mention working with float4s, but you don’t say if these are global memory reads your doing or what. float4s are going to be slower than float or float2 for global memory operations. Again, since I wasn’t sure what your code is doing based on your textual description, I didn’t mention this before, but this is an obvious potential pitfall to watch out for. For problems like yours, you’ll probably get more feedback if you post meaningful code snippets that others can toy with.

Cheers,

John

Thank you for the replies John.

No, I took different weeks to learn DirectX and CUDA ;)

Coalesced / coherent - I believe they are synonymous, at least in this context. From CUDA_Profiler_1.1.txt in the CUDA toolkit doc folder:

" gld_incoherent

gld_coherent      

gst_incoherent    

gst_coherent      

--------------

These options tell the profiler to record information about whether global

memory loads/stores are coalesced (coherent) or non-coalesced (incoherent). "

I agree whole-heartedly that there are many pitfalls to watch out for: coalesced / coherent reads/writes, paying attention to memory access patterns, the different bandwidths achievable for different data types (Mr Anderson’s testing has been very helpful in this regard), warp serialization due to PDC access patterns, grid/block size vs. register & PDC usage and their effects on occupancy and a host of others.

I have many different implementations of the same algorithm - the permutations vary on float4 vs float memory access, linear, 1D and 2D memory reads, and using the shared memory (PDC) or just registers. Because of the memory access pattern, float4 reads are a clear win - 1/4 of the memory references vs. float (experimentally - tried & timed).

The main performance-limiting issues I have observed in CUDA are either stated in the programming manual or have been varified by others here:

  1. Inability to write to a cudaArray (2D texture)

  2. < 5 GB/sec copy from linear memory to cudaArray

  3. long thread start-up latency

I have not been able to optimize out the warp serialization in the (float4 + uses PDC) case with more than 4 threads/warp because there are only 16 float banks in the global memory. However, the overall fastest version is 2D textures without using the PDC, unless you factor in the 5 GB/sec device-to-device memory copy that must follow, which takes significantly longer than the processing.

I’ll post meaningful code by Monday - I’m flat out in meeting all day today.

The Code:

#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <math.h>
#include <windows.h>

// includes, project
#include <C:\Program Files\NVIDIA Corporation\NVIDIA CUDA SDK\common\inc\cutil.h>

#define ITERATIONS 100

// ZZZZ - temp - need to figure out a way to render into a texture
float4* d_data = NULL;

// declare texture reference for 2D float texture
texture<float4, 2, cudaReadModeElementType> Tex1;
texture<float4, 2, cudaReadModeElementType> Tex2;
texture<float4, 2, cudaReadModeElementType> Tex3;

global
void ProcessTexs( float4* ThisTex, int VolResX, int VolResY, int Iters )
{
int MyX = blockIdx.xblockDim.x + threadIdx.x;
int MyY = blockIdx.y
blockDim.y + threadIdx.y;

// no data movement
// no computation

}

int
main( int argc, char** argv) {
int i, Iters;

CUT_DEVICE_INIT();

// the textures are all float4
cudaChannelFormatDesc ChannelDesc = cudaCreateChannelDesc<float4>();

// set texture parameters for all the textures
Tex1.addressMode[0] = cudaAddressModeClamp;
Tex1.addressMode[1] = cudaAddressModeClamp;
Tex1.filterMode = cudaFilterModePoint;
Tex1.normalized = false;    // access with normalized texture coordinates
Tex2.addressMode[0] = cudaAddressModeClamp;
Tex2.addressMode[1] = cudaAddressModeClamp;
Tex2.filterMode = cudaFilterModePoint;
Tex2.normalized = false;    // access with normalized texture coordinates
Tex3.addressMode[0] = cudaAddressModeClamp;
Tex3.addressMode[1] = cudaAddressModeClamp;
Tex3.filterMode = cudaFilterModePoint;
Tex3.normalized = false;    // access with normalized texture coordinates


// allocate memory on the board for the textures to process
// a bunch of pointers to CUDA arrays
cudaArray **SrcTextures = (cudaArray **)malloc(66 * sizeof(cudaArray *));

// allocate each for the CUDA arrays
for (i=0; i<66; i++) {
	    CUDA_SAFE_CALL( cudaMallocArray( &SrcTextures[i], &ChannelDesc, 256, 256 )); 
}

// allocate device memory for result
CUDA_SAFE_CALL( cudaMalloc( (void**) &d_data, 256*256*sizeof(float4)));

// make sure Windows isn't time-multiplexing the GPU during this run
Sleep(500);

dim3 DimGrid(32, 32, 1);
dim3 DimBlock(8, 8, 1);

// make sure there isn't anyhting else going on
CUDA_SAFE_CALL( cudaThreadSynchronize() );

// time the iterations
unsigned int timer = 0;
CUT_SAFE_CALL( cutCreateTimer( &timer));
CUT_SAFE_CALL( cutStartTimer( timer));


for (Iters = 0; Iters < ITERATIONS; Iters++) {
	for (i=0; i<64; i++) {
		CUDA_SAFE_CALL( cudaBindTextureToArray( Tex1, SrcTextures[i],   ChannelDesc));
		CUDA_SAFE_CALL( cudaBindTextureToArray( Tex2, SrcTextures[i+1], ChannelDesc));
		CUDA_SAFE_CALL( cudaBindTextureToArray( Tex3, SrcTextures[i+2], ChannelDesc));	


	    ProcessTexs<<< DimGrid, DimBlock, 0 >>>(d_data, 256, 256, Iters);

		// copy the results into the texture for the next iteration
		CUDA_SAFE_CALL( cudaMemcpyToArray( SrcTextures[i+2], 
										   0,0,
										   d_data,
										   256*256*sizeof(float4),
										   cudaMemcpyDeviceToDevice) );
	} // textures
} // iterations

CUDA_SAFE_CALL( cudaThreadSynchronize() );
CUT_SAFE_CALL( cutStopTimer( timer));
float TimeInMs = cutGetTimerValue( timer);
printf( "Processing time per batch: %f (ms)\n", TimeInMs / 100.0f);
CUT_SAFE_CALL( cutDeleteTimer( timer));

cudaFree(d_data);
for (i=0; i<66; i++) {
	    cudaFreeArray(SrcTextures[i]); 
}


CUT_EXIT(argc, argv);

}

This most closely mimics the DirectX implementation, but the CUDA GPU kernel is not doing any data movement nor any of the computation necessary to compute the results that the DirectX implementation is.

The DirectX implementation does all the data movement (complex reads, simple writes, about 40 lines of compute) for this data set in 8.84 ms on the same hardware.

The hardware:
3.4 GHz Pentium D, 4GB, 8800 GTX

The Performance:
Binds + Process = 7.02 ms
79% of the time DX takes to solve the entire problem is just CUDA overhead
But with CUDA I can’t render into a texture, so I have to copy the results back into the texture:
Binds + Process + Copy: 24.3 ms
2.75 longer than the DX implementation and I still haven’t moved any data or performed any computation in the kernel.

Grid and block size have practically no effect on how long a call to Process takes.
Copying 64 MB of data in (24.3-7.02 ms) is 3.7 GB/sec, and this is a device to device copy.

I played with your code briefly. It’s clear that you’re losing the majority of your speed due to the cudaMempyToArray() call. On one of our test boxes (GeForce 8800GTX, Opteron 148, RHEL4 Linux) I get 18.7 ms for the full run (vs your 24.3?), and 3.23 ms for the Binds+Process only. You mention that your memory access pattern is complex, so I presume that this is why you’re using a texture and not simply doing global memory reads/writes. From the structure of your kernel loop, it seems to me that you might potentially get much higher performance by doing some of your loops within the CUDA kernel itself, and by doing your memory operations within shared memory rather than texture refs, assuming your memory access pattern could be made semi-friendly to the hardware. Since you’re not using the texture hardware to do interpolation and you’re not using normalized textures, it seems like it’s basically just being accessed as a read-only array, in which case you might actually get better performance from doing straight global memory ops, and/or using shared memory to speed repetitive data references, assuming you can get the access pattern right. If so, the code might run a lot faster than you’ve ever seen it run before, even vs. the DX version. If not, then the question is what can be done to make your texture copy operation much faster.

Cheers,

John Stone

John,

Thank you for taking the time to try my code and reply. 2.17x faster for bind + process on your hardware running Linux is significant.

The memory access pattern is such that the biggest win is using the texture cache. This is the reason why the DX version runs so well.

you might potentially get much higher performance by doing some of your loops within the CUDA kernel itself, and by doing your memory operations within shared memory rather than texture refs, assuming your memory access pattern could be made semi-friendly to the hardware.

I have done both. There are 2 problems with this approach:

  1. because of the amount of memory touches per output point, the number of warps per MP is very low for minimal external bandwidth used (2). This means that occupancy is miniscule, but there are still 32 warps running at a time.
  2. There is large number of warp serializations in this approach since the data must be float4 and there are only 16 banks in the shared memory.

There must be a bug in CUDA for the copy - it should be nearly 20x faster. However, even if the copy were infinitely fast, it wouldn’t make up for the fact that the CUDA overhead is 79% of the entire DX execution, although with your hardware and Linux vs. XP, it’s ‘only’ 36.4%.

cudaMemcpyToArray is being looked into.

Have you tried using CUDA texture bound to linear memory? Since you’re not using filtering that ought to work. You’d just have to convert your current 2D tex coordinates to 1D for use with tex1Dfetch. It’ll still get you the caching benefit. And, since a linear texture is simply bound to a GPU address, you can write output to some global memory, then CPU can the output’s address to a texture and launch the next kernel/pass to use that as textured input. There’s no copying associated with a texture bind, so you should see a perf improvement. Please post whether this works or not.

Paulius

2D Textures: 21.57 ms
1D Textures: 31.7 ms

Is the data arranged in memory the same for 1D and 2D textures?
Is the cache setup any differently 1D vs. 2D?

No. What is your access pattern?