Titan Z lower performance than Tesla C2075

I’ve wrote an image processing algorithm and I’ve tested it in 3 GPU’s: Titan X, Titan Z and Tesla c2075.
The average parallel time was: Titan X - 33.00ms, Tesla c2075 - 72.00ms and Titan Z - 2,500.00ms. Although, I am not using both Titan Z chips, I suppose that its performance should be similar to a Titan Black. I also suppose Titan Z performance should be some between Titan X and Tesla c2025, but it wasn’t.

Titan X and Tesla c2075 are set up in a old computers with dual core processors and running Lubuntu 14.04, nvidia driver 361.62 and 352.79, respectively. In both is been used cuda 7.5. I don’t recall gcc version, but it might be some 4.x version.

Titan Z is set up in a modern Dell Precision 7910 with xeon processor, running Xubuntu 16.04, nvidia driver 367.27 and cuda 7.5. I also don’t recall gcc version, but I suppose it might be some 4.x or 5.0 version.

My question is, why I had so poor performance with Titan Z? What did I do wrong? I’ve compiled and ran in all machines same code and processed same image. Is there any issue here? Am I overvaluing Titan Z performance and expecting too much from it?

I am measuring time inside code and only kernel call. To measure time, I am surrounding my kernel call with a macro like that:

#define TIME(y,x) {                          \
        cudaEvent_t start, stop;                    \
        float time=0;                               \
        cudaEventCreate(&start);                    \
        cudaEventCreate(&stop);                     \
        cudaEventRecord(start);                     \
        x;                                          \
        cudaEventRecord(stop);                      \
        cudaEventSynchronize(stop);                 \
        cudaEventElapsedTime(&time, start, stop);   \
        printf("Time %s:%f\n",y, time);             \

It may be that you are benchmarking incorrectly. You should remove the CUDA start up time from your measurements.

Thanks. What do you mean removing CUDA start up?

After start up the computer, I’ve only compiled the code and run it. I am measuring time in same way in all machines with each GPU.

Typical benchmarking practice is to run your workload multiple times in the same session (i.e. in the same application run) and discard the timing from the first run. This gets rid of any start-up overhead.

CUDA start up overhead can vary significantly from machine to machine, based on system memory size and possibly other factors.

Maybe you’ve compiled the code differently. Or maybe your code is producing errors that you haven’t checked for. I’m sure there are other possibilities as well.

Which performance result should I expect from Titan Z using only one chip? Some between Tesla c2075 and Titan X or Am I wrong?

Well, I’m gonna try to run my kernel call 10 times in a row to avoid the problem you sad. However, I don’t know why I have slow time in Titan Z even if is the first run. I have no delay in the others machine. Is there any way to load cuda libraries in memory and avoid this first run delay?

By the way, I compiled the code with -O3 option and in Titan Z machine I’ve got some string functions errors. I’ve googled and some guys sad to use -D_FORCE_INLINES. I did and the code compiled without errors. It seems to be a compiling issue with cuda 7.5 in Ubuntu 16.04.

CUDA 7.5 isn’t officially supported on Ubuntu 16.04

Yes, I know. Do you guess that should be the reason?
I’ve tried CUDA 8.0 and I’ve had same overall performance.

I mention it because I think it’s connected to your statement about -D_FORCE_INLINES

You seem to want people to explain behavior of code you haven’t shown. I don’t think I’ll be able to help much with that. If I had to guess at the reason why a code takes around 30ms on one GPU and around 2500ms on another GPU, the only thing I can come up with is CUDA startup time being different on the two platforms. And I’ve already indicated a benchmarking practice that should be able to remove that effect from the timing for you.

Problem solved. It was a stupid thing. However, I am leaving my experience here to help anyone in similar situation.

As I am developing using NVIDIA NSight, in Titan Z machine I ran my code after compiling using Debug profile. After change it to Relase profile and compile it again, Titan Z time processing drops from 2,500.00 ms to 60.00 ms.

As I suppose, its time, using only one chip, was some between Tesla c2075 and TitanX.
It was nothing related to Driver, CUDA SDK, Linux Version, Kernel version, gcc version, host machine, etc. It was only a stupid thing which I’ve done.
I hope this information can be helpful to someone.

So, thanks txbob for your support.