2nd GPU Half Speed BUG?

Hi,

I’m working on a project where i am using the CUFFT_c2c_radix2 source and testing its performance against various batch size, with a view to progress to multigpu.

However i have discovered that if i run the same code on device 0 it will run at ~2.8 us per 1D FFT and ~5 us for device 1 for an 8192 batch 1D FFT.

Why is this and how do i correct it?

Cheers

I’ve seen similar problem recently. Solved by explicitly specifying CU_CTX_SCHED_YIELD in cuCtxCreate() but I don’t know how this translates to CUFFT.

Are device 0 and device 1 the same type? You have not said anything about them

Sorry, yes to qualify my specs:

2 x GTX260
Windows XP
Driver: 178.08

Im currently using the runtime API, although have found the same performance degredation in driver api also.

Cheers

Yes, both devices are GTX280, Vista x64, 178.13.

Edit: if Runtime API doesn’t allow to specify context creation flags then you probably need to path cudart.dll… or switch to Driver API. Let me know if problem is resolved by passing CU_CTX_SCHED_YIELD and if you need help with patching cudart.dll.

I upgraded to the latest driver 178.24 [Edited Typo] , which had no effect.

AndreiB - I’ve tried using CU_CTX_SCHED_YIELD on the driver api code, however this did not work.

How do i patch cudart.ddl?

If possible i’d like to stay on runtime api for now…

Cheers

If CU_CTX_SCHED_YIELD didn’t work for you then problem is somewhere else. No need to change cudart.dll then.

The latest drivers are 177.80 or 178.24 depending on OS. Yes they are not official CUDA drivers, but it still is/can be a good idea to try them.

Edited driver typo Reimar spotted.

Cheers

Are you including PCIe transfer times in these performance results?

Yes these times include the memory copy host to device and device to host.

Cheers

Then quite possibly one of your cards is running at 16x and one is running at 8x. Look at bandwidthTest --memory=pinned --device=N for each device.

This would appear to be the case:

Running on…
device 0:GeForce GTX 260
Quick Mode
Host to Device Bandwidth for Pinned memory
.
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 3157.5

Quick Mode
Device to Host Bandwidth for Pinned memory
.
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 3283.2


Running on…
device 1:GeForce GTX 260
Quick Mode
Host to Device Bandwidth for Pinned memory
.
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 1733.7

Quick Mode
Device to Host Bandwidth for Pinned memory
.
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 1675.1

I have a Inno3D 680i SLI motherboard which is meant to feature two full bandwidth 16 lane pcie slots, so i must have missed a setting somewhere. They are seated in the correct slots.

Any ideas?