Howdy, Stranger!
It looks like you're new here. If you want to get involved, click one of these buttons!
Categories
- All Discussions1,524
- General534
- Graphics109
- GPU Computing419
- Mobile141
- Pro Graphics163
- Tools158
In this Discussion
- iourikarpov February 22
- kalman February 9
- mwagner February 10
- Ruslan Yushchenko February 9
Tags in this Discussion
- cuda 422
- linux 177
- tesla 50
- gpu-profiling 43
Degradion Performance 4.1 over 4.0
-
Hi all,
due the fact our application has to not be simply fast but it should perform
some operations with fixed deadlines (we analyze a continuous radio signal)
we perform several time per day benchmarks of all our algorithm.
We are experiencing a clear degradation adopting CUDA 4.1 over the old CUDA 4.0.
I have attached 4 images showing the historical performance data of 4 algorithms
(they are not all the affected ones, but the simplest to show you the kernel code).
For all graphs the reported time is in milliseconds (y-axis).
All kernels are launched in this way:
#define BLOCK_SIZE (1<<9)
dim3 myThreads(BLOCK_SIZE);
dim3 myGrid( (aSize + BLOCK_SIZE - 1) / BLOCK_SIZE);
Kernel<<< myGrid, myThreads>>>(.....);
We have the C2050 cards with ECC off.
============================================================================
Sum of two complex vectors (2^20 complex): add_cc.png
__global__ void
VectorVectorSumKernelCC_O(const float2* aIn1,
const float2* aIn2,
float2* aOut,
const unsigned int aSize) {
const unsigned int myPos = blockIdx.x * blockDim.x + threadIdx.x;
if (myPos < aSize) {
aOut[myPos].x = aIn1[myPos].x + aIn2[myPos].x;
aOut[myPos].y = aIn1[myPos].y + aIn2[myPos].y;
}
}
============================================================================
Product of two complex vectors (2^20 complex): mul_cc.png
__global__ void
MulKernel_cv_cv_o(const float2* aIn1,
const float2* aIn2,
float2* aOut,
const unsigned int aSize) {
const unsigned int myPos = blockIdx.x * blockDim.x + threadIdx.x;
if (myPos < aSize) {
const float myReal1 = aIn1[myPos].x;
const float myReal2 = aIn2[myPos].x;
const float myImag1 = aIn1[myPos].y;
const float myImag2 = aIn2[myPos].y;
aOut[myPos].x = myReal1 * myReal2 - myImag1 * myImag2;
aOut[myPos].y = myReal1 * myImag2 + myImag1 * myReal2;
}
}
============================================================================
Product of two complex vectors (2^20 complex), in place: mul_cc_i.png
__global__ void
MulKernel_cv_cv_i(const float2* aIn,
float2* aInOut,
const unsigned int aSize) {
const unsigned int myPos = blockIdx.x * blockDim.x + threadIdx.x;
if (myPos < aSize) {
const float myTmp = aInOut[myPos].x;
const float myInR = aIn[myPos].x;
const float myInI = aIn[myPos].y;
aInOut[myPos].x = myInR * aInOut[myPos].x - myInI * aInOut[myPos].y;
aInOut[myPos].y = myInR * aInOut[myPos].y + myInI * myTmp;
}
}
============================================================================
Tone generation (2^20 vector long): tone.png
__global__ void
ComplexExpKernel(float2* aInOut,
const unsigned int aSize,
const float aMagnitude,
const float aNormalizedFrequency,
const float aInverseFrequency,
const float aPhase) {
const unsigned int myPos = blockIdx.x * blockDim.x + threadIdx.x;
if (myPos < aSize) {
const float myArgument = aNormalizedFrequency * fmodf((float)myPos, aInverseFrequency) + aPhase;
aInOut[myPos].x = aMagnitude * __cosf(myArgument);
aInOut[myPos].y = aMagnitude * __sinf(myArgument);
}
}
============================================================================
add_cc.png804 x 302 - 10K
mul_cc.png802 x 303 - 11K
mul_cc_i.png799 x 301 - 10K
tone.png801 x 303 - 10K -
4 Comments sorted by
-
Hi, I have similar performance degradation issues as well. I also use complex arithmetic using float2 as a container.
-
We see a similar effect (lower performance with Cuda 4.1) which might be related to this: Using the new visual profiler our kernel (compiled with Cuda 4.1) which use complex numbers achieve a global load efficiency of 50% (+- some errors ~ 1%). When I compile the same code with the nvcc from Cuda 4.0 We get roughly 100%.
-
i'm seeing 25% performance detriment due to 4.1 on a massive MC simulation. I was told to submit a bug report. The memory spills increased substantially with 4.1 compilation vs 4.0. There are plenty similar cases mentioned around forums, many of them simulation related. I speculate that with Kepler, many of these issues will go away, and the new compiler is designed to take advantage of the new arch. Thus, it's best to wait, and stick with 4.0 in the meantime.