Copy performance on kernel

This topic is related to this one:

When executing this code:

__global__ void initializeMem(int* texOut, int width) {

	int x = __mul24(blockIdx.x, blockDim.x) + threadIdx.x;

	int y = __mul24(blockIdx.y, blockDim.y) + threadIdx.y;

	int index = __mul24(y, width) + x;

	texOut[index] = 0;


Memory size: an image with 3840x3840

kernel configuration: threads (32, 12, 1) x blocks (120,320, 1)

I’m getting these results:

Time: 1.695416ms

Rate: 8697.33 Mpixels/sec

Bandwidth: 34.789329 GB/s

Shouldn’t the bandwidth be about 70GB/s?

Obs.: Those results were in CUDA 1.1 with a dedicated 8800 GTX.

See this thread for an extensive benchmark of global memory reads and writes of different sizes.…ndpost&p=290441

My write only tests in the post above reach 50GiB/s. I’m not sure why yours is going so slow, the only real difference is that I’m using 1D grids and you are using 2D. Also, to compare numbers, make sure you are calculating GiB/s: bytes / s / 1024^3 GiB/byte, and NOT using just 1e9 GB/byte.

I used the timing functions from CUT instead of using cudaEventRecord etc.
That was my error. Also 2D grids help degrading performance, since I got your benchmarks and adapted to use 2D grids.

Thanks for replying!

I’ve gotten to just under 70GB/s with 2D grids. Using “thinner” blocks (for example, 128x2 vs 16x16) and reading 64bit elements helps performance.