I found this article by the author of an open source bitset creation library that compares the performance of certain bit manipulation routines tested on the CPU, then using SSE2, and finally CUDA. In it there are some tips covering CUDA optimizations. I thought it might interest some of you:
The timings for CUDA includes transfers back and forth, although it is not clear where the data originally originated from nor where it should be going?
Anyway, the innermost loop got mangled in the HTML translation. This is what I think was intended:
[codebox] cutilCheckError(cutCreateTimer(&timer));
cutilCheckError(cutStartTimer(timer));
for (unsigned int i = 0; i < 10000; i++)
{
randomInit(h_A, size_B * PROC,i);
cutilSafeCall(cudaMemcpyToSymbol(c_A, h_A,mem_size_B,0,cudaMemcpyHostToDevice));
dim3 trs3(32,16);
TM<<<PROC,trs3>>>(d_B,d_C);
}
cutilSafeCall(cudaThreadSynchronize());
cutilCheckError(cutStopTimer(timer));[/codebox]
Processing time: 0.116994 (ms)
on a GT220, PCIe 1.0a
Wow, with the GT220 the price of a brand new entry level CUDA card is now $70 USD. I’m still going to get a GT260 since it has 216 cores as opposed to the GT220’s 48, but that’s very cool. The next big question will be finding out if I can stuff multiple GT260’s into a Windows 7 box, without Win7 coughing due to driver issues. Also, my motherboard is P55 based and not X58 so PCIe slots will be an issue and I’ll be lucky if I can fit even 2 in the same box. Next system I build will be X58 based and with that one I’m going to see if I can get up to 4 GTX 220’s in the same box. Of course this is all a pipe dream if the cards can’t coexist in the same box to their video display components. But if it works, can you imagine having 864 cores in a box that costs about $1500 USD? That’s unreal.
Robert.