performance optimisation, general uestions

Hello,

i just finished the first working version of a library that i wrote,
but i’m yet a bit disappointed about the performance. Basically,
it runs faster on the CPU though the problem can be split into parallel
work packages very well (in an example: 14112).

In the library i do roughly:

  • host reads in a file and extracts some administrative data from it
  • host allocates some memory on the device (input and output buffers)
  • host transfers data to the device (in an example: 86 Mb)
  • host starts the kernels on the device
  • the kernels read the admin data, the input buffers and fill the output buffer
  • host copies data from device back to host

I have some questions, it would be great if anybody had some hints:

  • Can i allocate constant memory at runtime? I hope constant memory speeds
    up the whole process. But i only get to know the memory size needed at runtime.

  • The kernels won’t conflict when filling the output buffer, this is guaranteed
    by the algorithm. Can i switch off any synchronisation to speed the writing up?
    Is there even any synchronisation when writing to output buffers?

  • As i wrote a library, there could be several users at the same time. Can i use
    global symbols in my library at all? When using constant memory, i’d need
    to use global symbols. Will the library work with several users then?

  • When calling the kernel, i try to set the grids high, the threads not so high:
    kernel<<<N, 1>>>(parameters…), but i noticed that when calling with 256 threads,
    the code runs faster:kernel<<<N, 256>>>(parameters…) I don’t understand this,
    i thought that parallel execution is much more independent when having more
    blocks, not more threads.

  • My CPU is an Intel Core 2 Quad with 2.6 GHz, my GPU is a GTX260. I can parallelise
    the work into 14112 smaller packages on the GPU. Shouldn’t the GPU be much faster
    than the CPU?

Thanks for any hints,
Torsten.