i just finished the first working version of a library that i wrote,
but i’m yet a bit disappointed about the performance. Basically,
it runs faster on the CPU though the problem can be split into parallel
work packages very well (in an example: 14112).
In the library i do roughly:
- host reads in a file and extracts some administrative data from it
- host allocates some memory on the device (input and output buffers)
- host transfers data to the device (in an example: 86 Mb)
- host starts the kernels on the device
- the kernels read the admin data, the input buffers and fill the output buffer
- host copies data from device back to host
I have some questions, it would be great if anybody had some hints:
Can i allocate constant memory at runtime? I hope constant memory speeds
up the whole process. But i only get to know the memory size needed at runtime.
The kernels won’t conflict when filling the output buffer, this is guaranteed
by the algorithm. Can i switch off any synchronisation to speed the writing up?
Is there even any synchronisation when writing to output buffers?
As i wrote a library, there could be several users at the same time. Can i use
global symbols in my library at all? When using constant memory, i’d need
to use global symbols. Will the library work with several users then?
When calling the kernel, i try to set the grids high, the threads not so high:
kernel<<<N, 1>>>(parameters…), but i noticed that when calling with 256 threads,
the code runs faster:kernel<<<N, 256>>>(parameters…) I don’t understand this,
i thought that parallel execution is much more independent when having more
blocks, not more threads.
My CPU is an Intel Core 2 Quad with 2.6 GHz, my GPU is a GTX260. I can parallelise
the work into 14112 smaller packages on the GPU. Shouldn’t the GPU be much faster
than the CPU?
Thanks for any hints,