__global__ void initializeMem(int* texOut, int width) {
int x = __mul24(blockIdx.x, blockDim.x) + threadIdx.x;
int y = __mul24(blockIdx.y, blockDim.y) + threadIdx.y;
int index = __mul24(y, width) + x;
texOut[index] = 0;
}
Memory size: an image with 3840x3840
kernel configuration: threads (32, 12, 1) x blocks (120,320, 1)
I’m getting these results:
Time: 1.695416ms
Rate: 8697.33 Mpixels/sec
Bandwidth: 34.789329 GB/s
Shouldn’t the bandwidth be about 70GB/s?
Obs.: Those results were in CUDA 1.1 with a dedicated 8800 GTX.
My write only tests in the post above reach 50GiB/s. I’m not sure why yours is going so slow, the only real difference is that I’m using 1D grids and you are using 2D. Also, to compare numbers, make sure you are calculating GiB/s: bytes / s / 1024^3 GiB/byte, and NOT using just 1e9 GB/byte.
I used the timing functions from CUT instead of using cudaEventRecord etc.
That was my error. Also 2D grids help degrading performance, since I got your benchmarks and adapted to use 2D grids.