Cuda program results are always zero in HW, correct in EMU?

Hi all!

I am having a weird problem … I have written a CUDA code which executes correctly in emulation and all results show up… however, when executed on hardware “G210” … the results in the result memory are always 0

I am passing two vectors to the kernel, one with random variables the other is initialized to zero, the code copies the first vector to shared memory, does some swapping and other operations and then writes back the results on the second vector (the one with the initial 0’s)

I am using double precision, the -arch sm13 flag is used, all memory allocation also use sizeof(double) …

I have checked if the kernel is invoked, it does … so no problems here … the cudaMemCpy has no problems …

what could be the problem … :( why would it work in emulation but not on HW

I am quite confused … any ideas?

Your card is compute capability 1.2, not 1.3 and does not support double precision floating point. That might have something to do with it…

The G210 doesn’t support double precision…try it in single precision and see what happens…

EDIT: avidday beat me to it…

You are both right … Thanks … i did it in single precision and it all worked fine , correct results :)

But am afraid to say that the the gpu code runs slower than the cpu code!! :((

Is there a serious expectation that the slowest discrete GPU that NVIDIA make will be faster than a given CPU? You have 2 multiprocessors and 8G/s of global memory bandwidth to play with. That isn’t much.

I have previously wrote codes for tiled matrix convolution of arbitrary size matrices … tiled matrix multiplication for large matrices and I had nice results on these cards, but on this given code, it isn’t given a slight increase … only worst performance … !! and my professor expects that I impress him with a speedup on a “two-node cluster” of these low end GPUs … and he won’t take a No for an answer and he wouldn’t listen … :(

I understand that your professor won’t listen. But try to tell him that, in order to speed up a calculation, it is essential that there is a calculation at all. A task that only consists of reordering data will never get sped up by additionally moving the data twice through a PCIe link that has half of the main memory bandwidth and a 50th of the L2 cache bandwidth where the data in this benchmark (?) is likely to reside.

Alternatively, tell him to buy you two Fermi cards so that you actually have a 2-node system. :)

Your code cannot possibly be very fast compared to CPU. If your professor is too stubborn to accept not every problem can be magically sped up 100x with CUDA, show him some calculations.

Your kernel basically does one global memory read, very little local arithmetic and shared memory shuffling and one global memory write. It is completely bandwidth bound. For a kernel to be compute bound you’d need something in the order of 50 arithmetic operations per memory access if not more. When a kernel is compute bound, the only number that counts is GB/s. Forget about FLOPS, they get important only as your kernel becomes compute bound.

Your GPU has a peak memory bandwidth of around 8 GB/s. My desktop PC’s CPU with DDR2 RAM does around 5 GB/s - you can assume 5-10 GB/s as your “standard” bandwidth you get on a non-server CPU (more with i7 and such).
Plus you need to copy the data over to the GPU and back through PCIe. PCIe 2.0 bandwidth with pageable memory is, on my machine, 1.5 GB/s. Around 3.5 GB/s with pinned memory, let’s take that more optimistic variant.

Now, assume you do computations on an array of 256k floats, total 1MB of data.
On the cpu, the time to read 1 MB and then store 1 MB would be:
1 MB / (5 GB/s) + 1 MB / (5 GB/s) = 0.4 ms
The time of the shuffling in L1 cache and whatever collateral arithmetic ops you perform is negligible.

On the GPU the total time is a PCI transfer, ram read, ram write, PCI transfer:
1 MB / (3.5 GB/s) + 1 MB / (8 GB/s) + 1 MB / (8 GB/s) + 1 MB / (3.5 GB/s) = 0.82 ms

Now you say you have a cluster? So let’s add the time to send and receive 1MB through the network - say it’s 100Mb Fast Ethernet , effectively ~10MB/s
gpuTime + 1 MB / (10 MB/s) + 1 MB / (10 MB/s) = 0.82 ms + 200 ms = 200 ms+
Amdahl’s law at its finest

See the problem here? It just doesn’t make sense to send two vectors to a GPU just do to a glorified copy, much less over network, it will never be as fast as the CPU can get, even if it’s an embarrassingly parallel problem and you’re exploiting every last bit of parallelism there is.

Matrix multiplication is different because if the copy through PCIe is of size N, you do N^2 ram accesses, so for big enough N it becomes profitable even for slow PCI and only a slight edge in ram bandwidth (assuming there’s enough ram on board). Plus there are some arithmetic operations that can, in some cases and with proper implementation, make this actually achieve nice FLOPS.
So for matrices, the rough time calculation would go like:
N MB / (3.5 GB/s) + N^2 MB / (8 GB/s) + N MB / (3.5 GB/s)
It’s a quadratic equation, the multiplier for N^2 (ram bandwidth) becomes much more important than multipliers for the linear components (pcie bandwidth) for large N.

Hi all …
I am truly thankful for your help and patience! I am not asking anyone to solve a code for me … not at all … just to confirm your conclusions . i post the original and my code to clear the idea … and any helpful thoughts … if any …

btw … we move all our files onto the so called cluster … and execute from there locally … no network overhead …

To start with … in the essence the whole problem is About applying what is called the Gamma matrix (or dirac matrix) to a complex vector … this problem is related to something in Lattice QCD physics of which I know nothing about … I am not a physicist myself … we just took this 6 weeks course in CUDA and were given these codes to accelerate …

What was given is that there is this matrix Gamma1 which is to be applied to every element in a vector … the matrix is of this shape:

0 0 0 +i
0 0 +i 0
0 -i 0 0
-i 0 0 0

the code of which is: (obviously it is memory bound as you said )

After analyzing the code … it effectively does the following swaps: (shown without the sign conversion)
The swaps between each 24 elements!

This is the CPU code:

void gamma1_x_spinor_DGR(double *input, int sp_idx)
{
Double2 swap[4];

for(int c = 0; c < NCLR; c++)
{
swap[3].re = input[sp_idx * (432) + 0 * (32) + c * (2) + 1];
swap[3].im = - input[sp_idx * (4
32) + 0 * (32) + c * (2) + 0];

swap[2].re = input[sp_idx * (4*3*2) + 1 * (3*2) + c * (2) + 1];
swap[2].im = - input[sp_idx * (4*3*2) + 1 * (3*2) + c * (2) + 0];

swap[1].re = - input[sp_idx * (4*3*2) + 2 * (3*2) + c * (2) + 1];
swap[1].im = input[sp_idx * (4*3*2) + 2 * (3*2) + c * (2) + 0];

swap[0].re = - input[sp_idx * (4*3*2) + 3 * (3*2) + c * (2) + 1];
swap[0].im = input[sp_idx * (4*3*2) + 3 * (3*2) + c * (2) + 0];

for (int s = 0; s < NSPN; s++)
{
  input[sp_idx * (4*3*2) + s * (3*2) + c * (2) + 0] = swap[s].re;
  input[sp_idx * (4*3*2) + s * (3*2) + c * (2) + 1] = swap[s].im;
}

}
}

And this is the last version I used (based on Tera’s help with a modification of mine):

#define subVector 192

global void gamma1_Kernel(Matrix M, Matrix N )
{
//subvector is a multiple of 24
shared float Mds[subVector];
int tx = threadIdx.x;
int basex = blockDim.x * blockIdx.x;
float sign = 2*((tx & 0x0001) ^ (tx>((tx/24)*24+11)))-1; // I used this complex form to determine the sign to avoid divergence … if statements and the % and ?

int permuted_idx = ((tx/6) ^ 3) * 6 + (tx%6) ^ 1; // I need to work on this a little to try to avoid / and %

Mds[tx] =  sign * M.elements[basex + permuted_idx];

__syncthreads();

N.elements[basex + tx] = Mds[tx];
}

This is one of may versions I have tried … this is the last and fastest one …

time on gpu = 0.151946ms
time on cpu = 0.091573ms

Thanks for giving a bit of the bigger picture, which certainly helps in giving advice. Some random things that come to my mind:

[list=1]

[*]Why does your vector have 24 elements? In the example I only see a complex 4-vector = 8 real elements.

[*]You certainly want to do more with the data than just apply the Gamma matrix. The key to decent performance is not to move the data onto the device and back for each operation. In fact, for this simple operation you likely don’t even want to start a separate kernel, but merge it with the previous or next operation.

[*]Don’t just do one matrix multiplication per thread. Instead of optimizing the index and sign calculation, and incurring the thread setup overhead that you cannot do anything about (even though it’s small), compute them once and then loop over several matrix multiplications in each thread.

[*]You can avoid the / and % operators in the setup code by using a two-dimensional block with blockDim.x=6. (You could just drop the %6 anyway, which I just left in for clarity)

[*]There is no need to use shared memory. Mds[tx] can just as well be an automatic variable, which also shows that there is no need for the __syncthreads() either.

[*]If you really want to do lattice QCD, and not just learn CUDA, get your Prof. to buy you some decent hardware. A few GTX240, or a GTX260 … GTX295, or a Fermi.

I am the one who should be saying thanks a 1000 times …

  1. the input vector is a slice of 49152 elements of a larger vector , sp_idx (vector index) is incremented from 0 to <2048 and the code is called… as you can see, each call to the CPU code handles 24 elements at once

  2. actually the code applies 16 gamma matrices from 1 to 16 serially to the same vector … all do the same thing but with different swap patterns and signs … after each matrix is applied, the elements of the vector are reduced to some value … and stored … subsequent reductions after each matrix is added to the previous reduction .

  3. I actually did use two dimensional block … though the swaps are right the signs are not … i am working on it still …

  4. you suggest I use the registers instead ??

  5. Finally … I want to check smthg regarding GPU time measurements … I have used both the sdk cutil library and the cuda events … I have started the timer before, ran the kernel 1,000,000 times, and stopped after cudaThreadSynchronize(), right? i mean stopping the timer is after the call to sync threads?

  6. I am quite depressed, after communicating my results to the professor he just said " I assume that you are not going to present these results as your grand finale on Monday!! Right?" :( Just don’t know what to do … at least I had quite a good experience the last few weeks and learned a lot of you people … but as for my grand finale … i think it will be total embarrassment!

Ah, professors, this sounds familiar…

I think it’s time to take another step back to see the bigger picture. At the moment we are talking about the optimization of a kernel that takes a few milliseconds to run (btw., does that include the time for copying to and from the device?). Certainly that’s not the final goal, as this time is probably less than what the enter key needs to lift back into position after you pressed it.

I assume that you want to run this code over and over again (which would also be compatible with my limited understanding of lattice QCD). The key to performance is to avoid unnecessary data transfer. You definitely don’t want to transfer between host and device between each kernel invocation. But you’d also want to avoid the saving and reloading data to and from device memory between kernel invocations, so you need to do fewer kernel invocations and do more work within each kernel. Just along the lines how you analyzed the end effect of the product of the 16 gamma matrices and applied that at once, instead of multiplying one at a time.


Yes, I suggest to use registers instead of shared memory. In the current code you are not using the inter-thread communication which the shared memory is meant for, but you incur it’s (small) speed penalty. And you should remove the __syncthreads() call - it serves no purpose.
Having said that, shared memory still is a very important measure to improve performance. I would recommend to look at how you could fold repeated kernel invocations into one, and where you could use shared memory to avoid going through main memory for inter-thread communication in that case.

I’d also uphold the recommendation to use a for loop inside the kernel to do more work in one thread once it is properly set up. Use it to reduce the thread and/or block dimensions.

The way you time the kernel run seems correct - wait for it’s execution with a cudaThreadSynchronize(), then stop the timer.

Hi,

I have been working and working with no magical results!

  1. After using registers, the GPU time slightly dropped to 0.14ms
  2. After reading and reading the posts to my question … I just want to make clear that the timing I present are the kernel computation times, without memory copy from host to device and back!! That’s the main reason my professor is not convinced in the whole memory bound thing
  3. I tried one thing, in the code I fixed permuted_idx to a constant and no sign involved (only the results will be wrong but it should only affect speed), the time dropped to 0.06, that’s the main overhead is in computing the sign and permuted_idx … if there is no sign and the memory access is fixed to a certain location, why isn’t there any magical speed ups …

Finally, this is my professor last answer:

You were able to accelerate codes in the Labs using the same GPU which you are using for your project! If you are claiming that resource limitations of the GPUs on the nodes of the cluster is the reason for you not being able to accelerate your project code, then you need to pin-point which of the different GPU resources are hurting your efforts.


Thanks again! :(

Hi,

I have been working and working with no magical results!

  1. After using registers, the GPU time slightly dropped to 0.14ms
  2. After reading and reading the posts to my question … I just want to make clear that the timing I present are the kernel computation times, without memory copy from host to device and back!! That’s the main reason my professor is not convinced in the whole memory bound thing
  3. I tried one thing, in the code I fixed permuted_idx to a constant and no sign involved (only the results will be wrong but it should only affect speed), the time dropped to 0.06, that’s the main overhead is in computing the sign and permuted_idx … if there is no sign and the memory access is fixed to a certain location, why isn’t there any magical speed ups …

Finally, this is my professor last answer:

You were able to accelerate codes in the Labs using the same GPU which you are using for your project! If you are claiming that resource limitations of the GPUs on the nodes of the cluster is the reason for you not being able to accelerate your project code, then you need to pin-point which of the different GPU resources are hurting your efforts.


Thanks again! :(

Yes, this is why I want you to do more work in a loop in each thread, to amortize setup time. I’m suggesting this the third time now, so please try that this time, or I’ll side with your professor in the future.

Yes, this is why I want you to do more work in a loop in each thread, to amortize setup time. I’m suggesting this the third time now, so please try that this time, or I’ll side with your professor in the future.

No need to side with my professor, actually my friend already did use the for loop … and we got no speed ups due to its use !!! … I just forgot to mention that! :)

Now we managed to execute the code on a GTX280 and a Fermi processor through the help of a friend in a remote campus, these are the speed ups, and our analysis:

Below are the results of running the kernels on the AC cluster. Device 0 is a GTX280, and Device 1 is a GTX480 (Fermi). The CPU is an Intel Nehalem 975 @ 3.33GHZ / 8MB cache with 24GB of memory.

gamma1:


Device 0: “GeForce GTX 280” with Compute 1.3 capability

time on GPU = 0.027268

time on CPU = 0.080985

Test PASSED

SPEED UP = 2.96X

Device 1: “GeForce GTX 480” with Compute 2.0 capability

time on GPU = 0.015788

time on CPU = 0.081051

Test PASSED

SPEED UP: 5.13X

Our analysis:

After analyzing the gamma matrix and contraction codes supplied (CPU version), it has been observed that these kernels are memory intensive with little floating arithmetic involved, therefore no magical speed ups of high orders are expected, to summarize:

  1. In all three cases, due to large caches, the inputs are cache resident for the CPU, offering good performance

  2. No extensive floating point computation is involved, mainly memory accesses, thus GPU execution times are is affected by the long latency of global memory accesses

  3. The G210 had slower execution times in comparison with GTX280 and GTX240 due to the smaller number of SMs, the increase of SMs only helped in load distribution thus more threads are doing work rather than idling waiting to be scheduled to the 2SMs, not related to the computation itself

Do you have any further comments … :)

Regards,

No need to side with my professor, actually my friend already did use the for loop … and we got no speed ups due to its use !!! … I just forgot to mention that! :)

Now we managed to execute the code on a GTX280 and a Fermi processor through the help of a friend in a remote campus, these are the speed ups, and our analysis:

Below are the results of running the kernels on the AC cluster. Device 0 is a GTX280, and Device 1 is a GTX480 (Fermi). The CPU is an Intel Nehalem 975 @ 3.33GHZ / 8MB cache with 24GB of memory.

gamma1:


Device 0: “GeForce GTX 280” with Compute 1.3 capability

time on GPU = 0.027268

time on CPU = 0.080985

Test PASSED

SPEED UP = 2.96X

Device 1: “GeForce GTX 480” with Compute 2.0 capability

time on GPU = 0.015788

time on CPU = 0.081051

Test PASSED

SPEED UP: 5.13X

Our analysis:

After analyzing the gamma matrix and contraction codes supplied (CPU version), it has been observed that these kernels are memory intensive with little floating arithmetic involved, therefore no magical speed ups of high orders are expected, to summarize:

  1. In all three cases, due to large caches, the inputs are cache resident for the CPU, offering good performance

  2. No extensive floating point computation is involved, mainly memory accesses, thus GPU execution times are is affected by the long latency of global memory accesses

  3. The G210 had slower execution times in comparison with GTX280 and GTX240 due to the smaller number of SMs, the increase of SMs only helped in load distribution thus more threads are doing work rather than idling waiting to be scheduled to the 2SMs, not related to the computation itself

Do you have any further comments … :)

Regards,

That makes sense.

That doesn’t make much sense. You can hide latency with enough work (this is Gustafson’s Law). Your results clearly show that the performance scales with memory bandwidth. The card width the largest memory bandwidth is the fastest, and the smallest memory bandwidth the slowest.

If you are always memory bandwidth bound, then the way to improve the performance is to increase the available memory bandwidth. The GT210 you started with has 8 Gb/s memory bandwidth and about 44 Gflop/s single precision peak from 16 cores, the GTX480 you finished with has about 175 Gb/s memory bandwidth and 1050 Gflops/s single precision peak from 480 cores. You have roughly 30 times more cores, 26 times more peak flops and 20 times more memory bandwidth. What is the ratio of their performance closest to?

That makes sense.

That doesn’t make much sense. You can hide latency with enough work (this is Gustafson’s Law). Your results clearly show that the performance scales with memory bandwidth. The card width the largest memory bandwidth is the fastest, and the smallest memory bandwidth the slowest.

If you are always memory bandwidth bound, then the way to improve the performance is to increase the available memory bandwidth. The GT210 you started with has 8 Gb/s memory bandwidth and about 44 Gflop/s single precision peak from 16 cores, the GTX480 you finished with has about 175 Gb/s memory bandwidth and 1050 Gflops/s single precision peak from 480 cores. You have roughly 30 times more cores, 26 times more peak flops and 20 times more memory bandwidth. What is the ratio of their performance closest to?