Device memory latency and lookup tables - please advise

I have been investigating CUDA before purchasing a Tesla or GTX 280 board, but unfortunately I don’t think it will work for my application…i.e. I don’t believe I will see any performance increase.
But before giving up I hope to get further input.

The problem I see is that my application has a large lookup table (megabytes), that must be r/w accessed randomly by threads. i.e ANY given thread might access ANY address in the lookup table. The values cannot be calculated on the fly by the GPUs. Since there is a high latency associated with random access to board device memory (200 cycles) it seems like that will eliminate any possible performance gain. I cannot guarantee that threads in a block will access address locations within 16k of each other so I don’t think that the shared memory will be of help.

Please let me know if there is something I’m overlooking because I’d really like to be able to use Tesla and or GTX 280.

it will be hard to get a big speed-up, for sure. imho, you may still benefit from CUDA by: 1. large number of concurrent threads running on the same multi-processor may ‘hide’ the memory access latency from each other; 2. the memory access latency/bandwidth on a G80/G200 card should be anyway better than that of the host.

Each thread would have to create a bundle of memory accesses, which I’m assuming is not possible. Further my quick calculation says that this latency is absolutely abysmal. Assuming the published clocking of 1.3Ghz and the 200 cycle latency, I could only make 6.5 M memory accesses per second, which is orders of magnitude slower than a CPU accesses RAM. Because each thread’s access to the lookup table is essentially random this seems to more than negate any possible performance gain.

Have you multiplied this number by the number of active threads that are working on all the multi-processors ?

Also since you say any thread can access any location and modify it – you are opening up a pandora box of race conditions. i.e. before thread I reads a location, thread J could modify it and vice-versa – thus resulting in signficantly different results.

How would you handle such a situation? Even if you do atomic operations over it, there is still no guarantee whether thread I or thread J will win the lock. So again the problem remains.

You need to think about this a lot because global memory sychrnonization is very very expensive than compared to global memory access.

These race conditions are NOT CUDA specific. They would exist even if you are on a multi-core CPU box.

If your problem is buying decision then try your idea first on some other GPU. NVIDIA boards are all around and sure you find one cheap in second hand or in friends house. For testing ideas any CUDA capable board is OK. Of course this will not compensate your work hours anyhow.

No modern processor will give you optimal performance for completely random memory accesses. In CUDA, using textures will give you good performance for look up tables.

Yes, listen to him, he knows what he is talking about.

If all you do is sit around and worry and calculate how slow the latency is, you’ll never get anywhere. Just get a cheap CUDA card and write up a simplified version of your code. Within a few days, you’ll have a working version and you will most likely be pleasantly surprised with your results. GPUS have an insane amount of bandwidth available: GTX 280 gets 100+ GiB/s under optimal conditions. With fully random accesses (from a texture fetch, mind you), you will probably get ~30 GiB/s.

I’m not making these numbers up either. My application’s bottlneck has (naively) completely random memory accesses, and I was still getting ~15-20x faster than the host. Of course when I optimized the order of data in memory so that is was no longer as random, I got a nice 3-5x speedup over the initial, but the benchmarks with completely random data do show some hope for you.

I use Mac Mini’s, and they are great, but they don’t allow for expansion boards. To purchase a new computer with enough memory, and a GTX 280 is an outlay that I won’t do unless I’m sure that I’m not wasting money & time. The Mac’s will suffice for non CUDA development for now.

texture memory space is too small if I’m not mistaken. It seem like these graphic boards are great if one is generating data, but definetly not a solution for necessarily random I/O to a large lookup table.

I wish I did not have this requirement, or that Device memory was shared. Someone mentioned about sync issues…that is not my issue, I don’t care if multiple threads r/w the same LT location, or in what order they access.

I was curious about this problem myself, so I wrote a test.

The lookup table is 1,048,576 floats.

The source data is 33,554,432 integers, selected randomly from the range 0 to 1,048,575.

The program reports the following (9800 GTX, Intel Core2 Duo E8400 @ 3.0GHz, with 4GB ram, Vista 32 bit):

Total time including copy to/from device: 185.089584 (ms)

Kernel time: 0.057689 (ms)

Time to verify results: 88.405998 (ms)

I’m using a “naive” kernel implementation that makes no use of textures:

global void naiveLUT(float *lut, int *input, float *output) {

int idx = (blockIdx.y*BLOCKSX + blockIdx.x)*THREADSPERBLOCK + threadIdx.x;

output[idx] = lut[input[idx]];

}

One question I have is that the kernel time seems unreasonably low, given a device to device bandwidth of “only” 45,744 MB/s. Is there an expert who can explain the low number?

Regardless, using the bandwidths reported in the bandwidth test, I calculate that the time to transfer to/from the device should be roughly 160ms, meaning that the LUT on the device (the kernel itself) is substantially faster than on the host at very roughly 25ms (probably plus or minus 80%). By how much is unclear.

This LUT is 4MB in size. Larger LUTs should perform equally well on the GPU since there is no caching and each pixel goes to global device memory. On the host, larger LUTs will perform worse and worse, since less and less will fit in the caches. Cache is fast. Main memory is slower than most people think.

The full program is below:

[codebox]

#include <stdlib.h>

#include <cutil_inline.h>

static const int LUTSIZE = 1024*1024;

static const int IMGSIZE = 3210241024;

static const int THREADSPERBLOCK = 64;

static const int NBLOCKS = IMGSIZE/THREADSPERBLOCK;

static const int BLOCKSX = 32768; // greatest allowed power of 2

static const int BLOCKSY = NBLOCKS/BLOCKSX;

global void naiveLUT(float *lut, int *input, float *output) {

int idx = (blockIdx.y*BLOCKSX + blockIdx.x)*THREADSPERBLOCK + threadIdx.x;

output[idx] = lut[input[idx]];

}

int myCheck(const char *msg) {

cudaError_t err = cudaGetLastError();

if (cudaSuccess != err) { printf("error %s: %s.\n", msg, cudaGetErrorString(err)); return 1; }

return 0;

}

void lutTest() {

cudaSetDevice(cutGetMaxGflopsDeviceId());

float *lutData = (float *)malloc(LUTSIZE * sizeof(float));

if (!lutData) { printf("lutData alloc failed\n"); return; }

for (int i=0; i < LUTSIZE; i++) {

	lutData[i] = i+sin((float)i);

}

int *imgData = (int *)malloc(IMGSIZE * sizeof(int));

if (!imgData) { printf("imgData alloc failed\n"); return; }

for (int i=0; i < IMGSIZE; i++) {

	imgData[i] = rand() % LUTSIZE;

}

float *resultData = (float *)malloc(IMGSIZE * sizeof(float));

if (!resultData) { printf("resultData alloc failed\n"); return; }

float *dev_lut;

cudaMalloc((void**) &dev_lut, LUTSIZE * sizeof(float));

if (myCheck("cudaMalloc 1")) { return; }

int *dev_input;

cudaMalloc((void**) &dev_input, IMGSIZE * sizeof(int));

if (myCheck("cudaMalloc 2")) { return; }

float *dev_result;

cudaMalloc((void**) &dev_result, IMGSIZE * sizeof(int));

if (myCheck("cudaMalloc 3")) { return; }

unsigned int timer1 = 0;

cutilCheckError(cutCreateTimer(&timer1));

unsigned int timer2 = 0;

cutilCheckError(cutCreateTimer(&timer2));

cutilCheckError(cutStartTimer(timer1));

cudaMemcpy(dev_lut, lutData, LUTSIZE * sizeof(float), cudaMemcpyHostToDevice);

if (myCheck("cudaMemcpy 1")) { return; }

cudaMemcpy(dev_input, imgData, IMGSIZE * sizeof(int), cudaMemcpyHostToDevice);

if (myCheck("cudaMemcpy 2")) { return; }

cutilCheckError(cutStartTimer(timer2));

dim3 threads(THREADSPERBLOCK, 1, 1);

dim3 grid(BLOCKSX, BLOCKSY, 1);

naiveLUT<<< grid, threads >>>(dev_lut, dev_input, dev_result);

if (myCheck("executing naiveLUT")) { return; }

cutilCheckError(cutStopTimer(timer2));

cudaMemcpy(resultData, dev_result, IMGSIZE * sizeof(float), cudaMemcpyDeviceToHost);

if (myCheck("cudaMemcpy 3")) { return; }

cutilCheckError(cutStopTimer(timer1));

printf("Total time including copy to/from device: %f (ms)\n", cutGetTimerValue(timer1));

cutilCheckError(cutDeleteTimer(timer1));

printf("Kernel time: %f (ms)\n", cutGetTimerValue(timer2));

cutilCheckError(cutDeleteTimer(timer2));

unsigned int timer3 = 0;

cutilCheckError(cutCreateTimer(&timer3));

cutilCheckError(cutStartTimer(timer3));

for (int i=0; i < IMGSIZE; i++) {

	if (resultData[i] != lutData[imgData[i]]) {

		printf("discrepancy at %d: %f, %f\n", i, resultData[i], lutData[imgData[i]]);

		break;

	}

}

cutilCheckError(cutStopTimer(timer3));

printf("Time to verify results: %f (ms)\n", cutGetTimerValue(timer3));

cutilCheckError(cutDeleteTimer(timer3));

}

[/codebox]

You need to put a cudaThreadSynchronize in before you stop the timer…

Thank you, this also explains some of my other problems…

Here are the new results:

Total time including copy to/from device: 197.509460 (ms)

Kernel time: 29.058861 (ms)

Time to verify results: 94.217072 (ms)

So this naive 2-line implementation on the device executes about 3x as fast as on the host. Pretty good if you ask me.

Hmm, I tried a similar test on my MAC Mini 1.8 Ghz, with the same number (33,554,432) of random memory accesses and it took 319ms with 1 thread. Now I’m not sure if 2 threads would make a difference, and I was not able to try it yet because of a compile error with the boost library I’m using. So it looks to me like CUDA beats a Mac Mini by 1.5 X perhaps. So although that is a performance increase, I don’t think there would be any performance gain for my app. CUDA enabled running on Tesla, versus my app. without CUDA on a Mac Pro or a Intel I7 box.

Given that GTX 280 only has 1GB of memory, for my particular application I would be forced to make compromises in the size of my lookup tables that I’d rather not make. With the 4GB Tesla I might not have to make those compromises (or at least not as drastically…I desire even more memory than just 4GB, to not feel contrained) .However, the Tesla is $1550 last time I checked. Then one must also add the cost of a computer that will work with it. But by far worst thing is the complexity using CUDA adds to my application. I think I would need to feel assured that there was at least a 5X-10X performance increase before it becomes worth the extra effort and money. I don’t see it happening, but I wanted to run the question by people here having CUDA experience, in the case I was missing something. Maybe I still am???

I was really excited by CUDA because I originally (but probably naively) thought that the Device memory was r/w randomly accessible at much faster speeds.

Read today that GTX285’s are coming with 2 GB of mem. Not sure at what speeds that will run, but I doubt it will run much slower.

Don’t forget that (theoretical) bandwidth on GTX285 is twice that of 9800 GTX!

One thing that isn’t clear: When you do a random read, how big of a block are you reading at this random location? Just 4 bytes (for one float)? If you need to read at least 64 byte blocks, you could probably come close to full utilization of the memory bus, even with random reads.

It’s not the number of threads or even the frequency of clock that made Jamie’s Core 2 Duo E8400 verified the data in less than 1/3 of the time your mini Mac 1.8G spent. It’s mostly the 6M L2 cache in E8400, and the time would increase as the size of LUT grows, like he pointed out, due to a growing cache missing rate.

Also the speed-up reported by Jamie is not 94.2 / 197.5, it’s 94.2 / 29. On my GT280 card, with a 4M (float) LUT and 32M data, the Kernel time is 12.6, that’s a almost 8x speed up to E8400 (btw, increase LUT size to 8M doesn’t change the Kernel time). Moreover, if I used texture for the LUT, there’s no more speed gains for large LUT, but for LUT smaller than 4K, a 2X speed-up to Jamie’s naiveLUT kernel was observed. And this is exactly the effect of 16K per-multiprocessor texture cache.

First, thanks for running Jaimie’s test on your GTX 280.

We’re not really comparing apples and apples here because Jamie’s random distribution is clustered. When using the same random function as Jaime, my Mac Mini performs the 32M random accesses in 170ms. I’d be interested to see your results when using:

int ii=0;

do{

for (int i=0; i < LUTSIZE; i++) {

imgData[ii*LUTSIZE + ( (rand()%1024) * (rand()%1024) )] = i;

imgData[ii*LUTSIZE + ( (rand()%1024) * (rand()%1024) )] = LUTSIZE-1-i;

}

ii++;

}while(ii<32);

Please increase the size of the imgData allocation by 1 or there can be an out of bounds condition.

Also, per Jaime’s example I didn’t look carefully at his source before, but I see now that he is measuring the time to tranfer from Host memory to Device memory and back. It appears that the kernel time is the performance I would be looking at if the tables are contained “in” the 1GB or 4GB device memory. That would be great if true but it seems inconsistent with NVidia’s published statement that random device memory access latency is 200 clock cycles. Maybe there is performance increases when the accesses aren’t truely random and diverse??? Again I think it will be telling to run the test again using my above randomizing code.

I’m not quite sure what you mean by clustered.

Also, I haven’t run your code but it looks as if not all elements of imgData will be initialized. If imgData is set to zero in the uninitialized locations, then on the host the lookup will hit in cache and be faster. If your data really has this characteristic then a GPU kernel could be made with a special case for zero, and I would expect correspondingly higher performance if a significant fraction of the elements had that value.

By “Clustered” I am meaning your random values will occur much more (from a probability perspective) in certain ranges than others. These values per your formula, when plotted, will not distibute anywhere close to evenly over the entire range of 1 to 1M (for the vast majority of random value sets calculated) . e.g. random values in the range of 1-10000 will occur much less frequently than random values in the range of 500,000 -510,000. As far as the elements not initialized, please do an initial pass through all elements and initialize to a random value per your original formula, then apply the second pass with mine. That was an ommission on my part.

In order for these tests to best simulate the conditions of my application, the aim must be to closely simulate random values (representing the lookup addresses) that are evenly distributed across the entire range of possible addresses. Maybe a better way would have been to assign each address (0-1048575 32 times) and then shuffle the deck so to speak. e.g. in a video frame taken from the real world, one cannot predict what one pixel value will be versus the next. (MAC Mini performance per what is below is 315 ms ± 5ms with single thread, and 217ms with 2 threads)

for (int i=0; i < IMGSIZE; i++) {The Official NVIDIA Forums | NVIDIA

imgData[i] = rand() % LUTSIZE;

}

int ii=0;

do{

for (int i=0; i < LUTSIZE; i++) {

imgData[ii*LUTSIZE + ( (rand()%1024) * (rand()%1024) )] = i;

imgData[ii*LUTSIZE + ( (rand()%1024) * (rand()%1024) )] = LUTSIZE-1-i;

}

ii++;

}while(ii<32);

It would be great if you could re-run your test with the randomizing changes I suggested. And if you still see the performance, then reconcile that for me against NVidia’s published latency. 12.6 would be nice but I don’t understand how you could get it with random memory reads. Where as Jaimie’s test was from host to device back to host. I’m really interested in read device memory LUT and write device memory LUT for random accesses. And I’m very interested in the GTX 280 results. Thanks.