I’m just getting started with cuda and I’m currently struggling to find the best way to move my data to the GPU.
The dataset is an array of unsigned chars of sizes like 1 M times 100 k. I have to process this dataset in tiles, so for the purpose of my question just consider a $large dataset.
For now consider my operation to be performed is subtracting the mean of each column of the matrix from each element of the matrix in that column.
For this purpose I need floats.
As the dataset is large, my intuition is that taking care of bandwidths is a good idea :D.
So any cool strategies?
Some own ideas for approaches I haven’t implemented (and I’d love to avoid implementing and benchmarking all of them just to check):
If I load chars from HDD and convert them to float values using CPUs I might be able to hide some HDD latency, but will have to transfer a much larger dataset from HtoD.
Here I would have file-> char buffer, buffer → page-locked float array on the host followed by an asynch memcopy.
If I copy chars to the device I’d have a very fast copy, but need to convert the array on the device, blocking a lot of additional memory on the device. Anyway the conversion might be faster (any experiences out there?).
I could use mapped host memory and then do the conversion on the device without additional Memory requirements (will this work?). Drawback: I have to feed 4 GPUs with different portions of the $big dataset and mapped Memory might make the reading from files even worse (frequent context switching to split data into different mapped arrays).
I should use Streams to overlap transfers and calculations.
Using a texture might help with a free char to f conversion, but here also a memcopy and double-mem is required to have write access afterwards.
So a lot of ways to get the data where it needs to be. Getting it done should be easy. But where do YOU expect the best performance?
Plattform is 4x GTX295s + 8 XEON Cores and I’m using CUDA 3.1 . (btw: I’m also interested in comments to optimize this up when using 4x C2070s instead)
If this is all you have to do (i.e. read form disk, compute means, subtract means, write results back), have you profiled the CPU version? This sounds like the sort of thing which is going to be completely bottlenecked on disk I/O. Disk seek times are milliseconds, and bandwidth is megabytes per second. A Nehalem motherboard can transfer gigabytes per second into the cores, and those cores can compute at a few gigaflops. All a GPU would do is make the bottle much wider, while keeping the same neck
Ah, OK sorry… I didn’t make this clear enough:
I can not load all my data at once. So while my kernels will have fun with the data (once it is on the device), I can bring up the next chunk to host MEM.
I hope the HDD speed won’t be the overall limiting factor as I have to perform Matrix mults on matrices of these sizes later on.
So please assume t_Kernel > t_LoadDataPartfromHDD. Otherwise I’m lost anyway =).
Thanks for your fast replies.
EDIT: One more hint. Even if this WOULD NOT be the case, then this just tells me I can extend my analyses, use double precision, … . In other words: My application has room to use the time given by HDD speed limitations.
So in that case I’m still interested in not wasting any time with transfers.
We’d worked that out. The point is that even a CPU can almost certainly keep up with the simple streaming computation you’ve described. The core handling the I/O will probably be busier fiddling around with filesystem datastructures than the core performing the averages.
Then give us more details about what else you’d like to do with this data.
The basic strategy would be to split the data into chunks such that you can get three complete ‘units of computational data’ onto the GPU at once - you’d have one stream data to the GPU, one computing, and the third draining the results. More than that is impossible to say, given your outstandingly vague descriptions.
Hi YDD,
the reason I didn’t give details on the algorithm is that I felt it doesn’t matter to my question.
Just for the sake of completeness:
My Application is: Given a lot of datasets (long vectors of unsigned char) and another dataset (smaller) (long vector of unsigned chars).
do some calculation on “another dataset” to derive a heap of new datasets (floats) (size= expand to 256 or even by up to 2^32 vectors if desired)
calculate pearsons correlation* between all combinations of sets from dataset 1 and 2.
If performance doesn’t suffer, the calculations on “the calculated heap of datasets” can be extended to go for bigger generated datasets (this can all be done on the GPU).
If the HDD is still limiting I could even go for double values and more precise math (would be a nice-to-have-gain I wouldn’t afford to get if it ruins my speed).
I’m only interested in all these correlation coefficients (this dataset is usually some orders smaller than the data fed to the algorithm, but scales with the size of “another dataset”).
Approach for each tile do:
Copy dataset (and generate dataset2 once)
calc mean of all sets
subtract mean from all sets and on the fly calculate the variance terms
do a giant matrix multiplication
normalize the result by multiplying (fast rsqrt is a blessing here =) ) each element with the variance terms
Copy result back
Anyway I plan to go for other algorithms with way more calculations aswell. The initial problem for all my tasks will be the same.
So back to my question/problem:
“Find a solution to copy an array of chars to an array of floats as fast as possible.”
of
“Find a solution to copy an array of chars to an array of floats wasting as few ressources as possible.”
It’ll be a tradeoff anyway. Question is if there are some ideas I don’t know that perform more efficient than a straightforward memcopy. For example by using textures or similar funny ways.
This shouldn’t require a discussion of the algorithm, but only the assumption that the Kernel will keep the hardware busy some time.
I think this application should be well suited to cuda, as it can be written more or less as a giant matrix multiplication where one of the matrices can be calculated on the fly. Using a single CPU core I can keep a PC busy for hrs without any probs.
Hope this helps somehow to understand my question. You brought up one point I haven’t thought about yet:
The mere fact of using streams could already allow to hide most of the host2device traffic when carefully chosen parameters are used. In this case I can do all conversions on the CPU and thus reserve the GPU computation power to the algorithm. But maybe there is “room for more”?
This is only partial answer that I’m making up on the hoof, (so please don’t shoot me down too harshly if I make an error) but…
I think I would copy the data to the GPU as UCHAR to minimise the Host-GPU transfer overheads.
I would consider processing the data as a 32-bit int array - i.e the kernel function would have a parameter int *pData in the function call, even though you are passing in a pointer to the UCHAR data. Each thread loads one 32-bit value, which should maximise coalescing performance. Each thread would then split the data into four BYTEs using a bit mask, then convert to four floats for processing. This means that each thread will process four values. Each thread could then compute the mean of those four values, then write a single mean-of-four value to the output, which maximises output coalescing. You can then repeatedly reduce the (now float) data to compute the final data set mean.
Even if that doesn’t quite fully work for you, I have found that accessing UCHAR’s as packed 32-bit Ints for coalescing purposes is generally a good idea. If that fails, texture fetches are generally my back-up plan.
I would use a pipeline kernel which processes one tile and does to host/device transfers with mapped host memory.
1.) Kernel launch 0 would upload tile0 [to device mem]
2.) Kernel launch 1 would upload tile1 and process tile0
3.) Kernel launch 2 would upload tile2 and process tile1 and download tile0 [back to host mem]
3.) Kernel launch 3 would upload tile3 and process tile2 and download tile1
[…]
If the kernel has enough math workload, the host mem accesses will be almost “for free”.
A cpu thread in the background would load the data from disk into host mem.
For the host->device transfer, i would load the data as dwords from mapped host mem, extract the 4 bytes with bitwise operations(>>, &) and store the converted floats in device mem.
Vice versa for trasnfering back the results to host mem.
May I suggest you map the data to a texture (e.g. bind a cudaArray to a texture).
Then you get automatic conversion from unsigned or signed char to float on a texture read,
given that you do a readModeNormalizedFloat. Floats will be either [0,1] or [-1,+1] depending
on signedness.
But if your bottleneck is disk I/O, I think you will not gain any speed from moving this
computation to the GPU.
This is also a good idea Christian, plus there is another benefit of using textures you didn’t mention. If you use bilinear interpolation, and fetch the middle of four points in a 2D array, bilinear interpolation will automagically give you the average of those four points. I use this technique for image down-sampling, and its well worth considering.
However, I believe that your algorithm is so simple that it will be memory bandwidth limited. If you can get the memory coalescing working well with linear memory then the byte packing technique I outlined earlier should be faster than the texture fetches.
One question regarding the texture fetch with bilinear filtering:
If I store datapoints as vectors of length 4. Can I get 4 averages of 4 values each in a single fetch using this? And will this come for “free” or does it keep otherwise useable HW ressources busy?
Thanks for your hints.
Markus
You would need to store data as 2-D arrays to really get the benefit of a 2x2 average using the texture fetch, which complicates your access pattern somewhat compared to the linear memory byte-packed strategy.
In order to perform multiple texture fetches as you suggest, you would need to bind multiple texture references, and they will probably compete for memory bandwidth. So I doubt that using multiple texture fetches will increase throughput, but it is an interesting question.