I was curious about this problem myself, so I wrote a test.
The lookup table is 1,048,576 floats.
The source data is 33,554,432 integers, selected randomly from the range 0 to 1,048,575.
The program reports the following (9800 GTX, Intel Core2 Duo E8400 @ 3.0GHz, with 4GB ram, Vista 32 bit):
Total time including copy to/from device: 185.089584 (ms)
Kernel time: 0.057689 (ms)
Time to verify results: 88.405998 (ms)
I’m using a “naive” kernel implementation that makes no use of textures:
global void naiveLUT(float *lut, int *input, float *output) {
int idx = (blockIdx.y*BLOCKSX + blockIdx.x)*THREADSPERBLOCK + threadIdx.x;
output[idx] = lut[input[idx]];
}
One question I have is that the kernel time seems unreasonably low, given a device to device bandwidth of “only” 45,744 MB/s. Is there an expert who can explain the low number?
Regardless, using the bandwidths reported in the bandwidth test, I calculate that the time to transfer to/from the device should be roughly 160ms, meaning that the LUT on the device (the kernel itself) is substantially faster than on the host at very roughly 25ms (probably plus or minus 80%). By how much is unclear.
This LUT is 4MB in size. Larger LUTs should perform equally well on the GPU since there is no caching and each pixel goes to global device memory. On the host, larger LUTs will perform worse and worse, since less and less will fit in the caches. Cache is fast. Main memory is slower than most people think.
The full program is below:
[codebox]
#include <stdlib.h>
#include <cutil_inline.h>
static const int LUTSIZE = 1024*1024;
static const int IMGSIZE = 3210241024;
static const int THREADSPERBLOCK = 64;
static const int NBLOCKS = IMGSIZE/THREADSPERBLOCK;
static const int BLOCKSX = 32768; // greatest allowed power of 2
static const int BLOCKSY = NBLOCKS/BLOCKSX;
global void naiveLUT(float *lut, int *input, float *output) {
int idx = (blockIdx.y*BLOCKSX + blockIdx.x)*THREADSPERBLOCK + threadIdx.x;
output[idx] = lut[input[idx]];
}
int myCheck(const char *msg) {
cudaError_t err = cudaGetLastError();
if (cudaSuccess != err) { printf("error %s: %s.\n", msg, cudaGetErrorString(err)); return 1; }
return 0;
}
void lutTest() {
cudaSetDevice(cutGetMaxGflopsDeviceId());
float *lutData = (float *)malloc(LUTSIZE * sizeof(float));
if (!lutData) { printf("lutData alloc failed\n"); return; }
for (int i=0; i < LUTSIZE; i++) {
lutData[i] = i+sin((float)i);
}
int *imgData = (int *)malloc(IMGSIZE * sizeof(int));
if (!imgData) { printf("imgData alloc failed\n"); return; }
for (int i=0; i < IMGSIZE; i++) {
imgData[i] = rand() % LUTSIZE;
}
float *resultData = (float *)malloc(IMGSIZE * sizeof(float));
if (!resultData) { printf("resultData alloc failed\n"); return; }
float *dev_lut;
cudaMalloc((void**) &dev_lut, LUTSIZE * sizeof(float));
if (myCheck("cudaMalloc 1")) { return; }
int *dev_input;
cudaMalloc((void**) &dev_input, IMGSIZE * sizeof(int));
if (myCheck("cudaMalloc 2")) { return; }
float *dev_result;
cudaMalloc((void**) &dev_result, IMGSIZE * sizeof(int));
if (myCheck("cudaMalloc 3")) { return; }
unsigned int timer1 = 0;
cutilCheckError(cutCreateTimer(&timer1));
unsigned int timer2 = 0;
cutilCheckError(cutCreateTimer(&timer2));
cutilCheckError(cutStartTimer(timer1));
cudaMemcpy(dev_lut, lutData, LUTSIZE * sizeof(float), cudaMemcpyHostToDevice);
if (myCheck("cudaMemcpy 1")) { return; }
cudaMemcpy(dev_input, imgData, IMGSIZE * sizeof(int), cudaMemcpyHostToDevice);
if (myCheck("cudaMemcpy 2")) { return; }
cutilCheckError(cutStartTimer(timer2));
dim3 threads(THREADSPERBLOCK, 1, 1);
dim3 grid(BLOCKSX, BLOCKSY, 1);
naiveLUT<<< grid, threads >>>(dev_lut, dev_input, dev_result);
if (myCheck("executing naiveLUT")) { return; }
cutilCheckError(cutStopTimer(timer2));
cudaMemcpy(resultData, dev_result, IMGSIZE * sizeof(float), cudaMemcpyDeviceToHost);
if (myCheck("cudaMemcpy 3")) { return; }
cutilCheckError(cutStopTimer(timer1));
printf("Total time including copy to/from device: %f (ms)\n", cutGetTimerValue(timer1));
cutilCheckError(cutDeleteTimer(timer1));
printf("Kernel time: %f (ms)\n", cutGetTimerValue(timer2));
cutilCheckError(cutDeleteTimer(timer2));
unsigned int timer3 = 0;
cutilCheckError(cutCreateTimer(&timer3));
cutilCheckError(cutStartTimer(timer3));
for (int i=0; i < IMGSIZE; i++) {
if (resultData[i] != lutData[imgData[i]]) {
printf("discrepancy at %d: %f, %f\n", i, resultData[i], lutData[imgData[i]]);
break;
}
}
cutilCheckError(cutStopTimer(timer3));
printf("Time to verify results: %f (ms)\n", cutGetTimerValue(timer3));
cutilCheckError(cutDeleteTimer(timer3));
}
[/codebox]