very large data set (big matrix)

Hi there

In our research we see a 1000000x1000000 matrix, we have a machine with a couple of CUDA capable cards. We are just starting with CUDA, but we would like to hear experiences and/or ideas with data sets that big.

We have found basic problems like: our card has 512MB of memorry, we try to allocate and array of floats of around 300MB and it fails, sometimes we have to go down to 100MB, so, is there any way to know the available memory on the CPU?

also

reading the documentation you find that
girdim: 65535 x 65535
blockdim: 512 x 512 x 64

so I could teoreticaly get 512 x 512 x 64 x 65535 x 65535 = 7.2055395 × 10^6 matrix positions that * 4 “sizeof(float)” = 2.8822158 × 10^17bytes = 2.6x10^8 GB

and that’s a lot of memory, how could you “maxout” (threads wise) the GPU without hitting the memory limit?

I know I can only use 512 threads per block, but still

so how could I tackle this, I know I can take one row at a time (actually 2 - one per GPU -) but that would be rather slow

any other tips, tecniques, or paper are apreciated

Hmm, get a 2GB consumer card or a Tesla (4 GB) for your large data sets. Consider using existing maths libraries like CUBLAS and other sparse matrix solvers.

Use a separate card for display, then more (unfragmented) memory is available for CUDA.

Depending on the complexity of your code, you may not be able to run 512 threads at all - register pressure will be a real problem.

Hi

our matrix is not sparce, and I will look into cublass, but what we are doing right now is checking if GPU could be an alternative for us, so far seems like not.

About teslas 4GB memory is better, but still far from good(for us).

Also a new question, we saw the s1070, we saw it uses 2 PCIe. The question is; when the host machine runs out of PCIe ports what has to be done? add one more host and use something like MPI??

thanks in advance

Hi - Could you please elaborate why you dont think GPU is good for you?

As for the large datasets - I use huge datasets (for seismic processing, that can get to 10s of GBs) - I run the code on both

GTX280, GTX295, C1060 and S1070 (the GTXs as ~1GB and the Teslas has 4GB). The idea is to break your data/algorithm/code

into chunks - that is allocate as much data as your device allows you, use it, save the results on the CPU/GPU and then continue

to process the other chunks. One thing to pay attention here is that you’d want to copy each chunk only once to the GPU (because of PCI overhead)

do as much calculation/all calculations on that chunk, discard it and bring the next chunk to the GPU.

This solution works great for me - for all types of cards.

As for the S1070 yes you need to 2 PCIs. You can use the HPDL360 which has two x16, meaning each GPU (out of the 4 in the S1070) will get x4 bandwidth.

You can use two HPDL160 host machines and connect each server to 2 GPUs (no need for MPI or stuff - just have each host use only 2 GPUs)

SuperMicro has a board with 4 x16 PCI-e - so you can connect that to a S1070 and get full x16 per GPU card.

There is also a DHIC card from nVidia that halves the bandwidth - so you can put 2 DHIC in a HPDL360 and connect 2 S1070 to it… a lot of fun :)

you’ll first need to understand how much you pay for the PCI overhead - if that’s not an issue or something you can live with you could use the DHIC or

use x4/x8 per card. Most likely you’ll have to experiement with it a bit…

hope that helps

eyal

Hi there and thanks

our data is also seismic, so I think your ideas might fit into our workflow

Well… our concern with GPUs is memory, right now we split that in ~16GBs chunks, and send each chunks to a compute node, so each node processes 16GBs of data. Even that way it takes a while to solve(each compute node is a very nice machine), so with GPUs we would have to split that in lets say ~(2-3)GBs (boss is willing to buy some teslas), and as you mention I could send that to 8 GPUs (2 s1070s)

sorry, but I´m still confused about the s1070s, lets say I want 32GPUs (I can deal with the PCI overhead) without messing with MPI (we want to avoid it as much as we can), so you are saying what I would need a setup like this:

HOSTa–(2)s1070(2)–HOSTb–(2)s1070(2)–HOST–(2)s1070(2)–HOST–(2)s1070(2)–HOST–(2)s1070(2)–HOSTn…

does not make sense in my head, but I´m new to this so…

how does the HOSTa tells the GPU connected to HOSTn to do something?

thanks in advance

That’s why seismic is so GPU friendly. You can break the data set (at least in my experience) to as small chunks as you’d

like and process it even with a 1GB GTX. Hi you broke the data into 16GBs for the CPU, why can’t you break it into

4GB or 1GB for the GPU? you probably can.

Moreover, since the computation is indeed long (as is in my case) you probably won’t be disk/network or even PCI

bounded - only computation bound. This is good news since computation is what the GPU is about.

That also means that if you measure your PCI overhead you’ll probably see that it accounts to 5-10% of overall computation

time. Therefore you should be able to play a bit with the configuration and increase the PCI overhead to, lets say, 20%

in order to connect 2 S1070 to 1 host server instead of 1 S1070 - thus increasing your density and reducing the amount

of money you need to spend on host machines.

Your chart is wrong as far as I understand what you tried to show :)

The simple scenario would be to connect one host (a quad lets say) to one S1070. so to get to 32 GPUs

your system would look like this (using the HP DL380 for example):

[Master node to distribute jobs( as you’re probably doing now with CPUs]

… |…|…|

[host-#1]…[host-#2]…(up to #8)… [host-#8]

… |…|…|

[S1070-#1]…[S1070-#2]… … …[S1070-#8]

Now lets say you can connect two S1070s to one host (with 8 CPU cores), your system

would look like this:

[Master node to distribute jobs( as you’re probably doing now with CPUs]

…|…|…|…|

[host-#1]…[host-#2]…[host-#3]…[host-#4]

…|…|…|…|

[S1070-#1]…[S1070-#3]…[S1070-#5]…[S1070-#7]

[S1070-#2]…[S1070-#4]…[S1070-#6]…[S1070-#8]

As you can see in both cases you need 8 S1070 (to get a total of 32 GPUs) the difference

is in the amount of host servers (8 in the first and 4 in the second).

I guess the reasons not to go with option 2 is its more risky, you need to be sure PCI penalty

wont kill your performance and finding a good host that can use 8 GPUs.

hope that helped a bit more :)

BTW - I started with two work stations with 3 GTX295 in each (they are independant and have

no connection between them) and only for production and after I had enough tests and numbers

I started to look at the Teslas.

eyal

Hi there

Yup you have helped us a lot, but I’m afraid i’ll have to bother you again :)

More splitting = More overhead, that’s my fear

So our confusion is (please take in mind that we are new to CUDA in general)

how to you use all the GPUs from the same program, like this:

How a piece of code running on [host-#1], can tell a gpu in [S1070-#8] (on [host-#8]), to load a chunk of the matrix.

in MPI it makes sence to me, but not here.

I’m guessing that if I do a cudaGetDeviceCount(), on [host-#1] I’ll get 7, or will I get 31?

if the answer is 7, then you need something like mpi to send a message to [host-#2…n] and tell them to use its GPUs

that’s our big question now

again thank you very much

With a single S1070 per host, you will get 4. With two S1070s per host you will get 8. Each S1070 contains 4 Telsa processors, and each is completely local to the host it is connected to. If you want to dynamically distribute a single large task across Telsas on different hosts, then you will need some sort of interprocess communications mechanism like MPI. If the task is really “embarassingly parallel”, then you might be able to pre-divide the task into sub-tasks and use middleware like SunGrid engine or some other grid scheduling software to schedule and run the sub-tasks as standalone processes on each host.

Thats actually shouldn’t be an issue. If you manage to break your 16GB to half, you’ll be able to

break it into 10,20,… smaller pieces.

Let me describe what I do. I have a TCP master which listens and waits for job requests from the slaves.

Each slave (distinct host server) can use 1…8 GPU cards. This is how it can be 1…8:

1 GPU per Host: a single GTX280 or a single C1060

2 GPUs per Host: 2 GTX280, 2 C1060, 1 GTX295 (which is dual), or one half of a S1070

3 GPUs per Host: 3 GTX280, 3 C1060, 1 GTX280 or 1 C1060 and one GTX295

4 GPUs per Host: 1 S1070, 4 GTX280, 4 C1060, 2 GTX295

8 GPUs per Host: 2 S1070 or maybe 4 GTX295 (Colfax just released a 8 C1080 case)

When you run devicequery you see only the GPUs PHYSICALLY connected to your host machine.

Each GPU is independant (both within the context of the host and certainly within the context of the 4 or 8 or whatever

number of hosts node you have).

Lets get back to the master/slave paradigm. Each slave opens up X CPU threads (pthreads on linux for example)

where X is the number of physical GPUs that slave/host have. It then requests the Master node to send the

data needed for processing (for seismic - each GPU for example will process a different trace or a part of the trace or

one velocity for one trace…) it then will copy the data over PCI to his dedicated GPU (by using cudaSetDevice you attach

a different GPU to a different CPU thread - google for GPUWorker by Mr Anderson here in the newsgroups and you can

have a working implementation of this).

Then each slave/host thread starts a kernel to process all the data - once the GPU is done, copy back the results to the

CPU and send the result to the Master. Get a new job :)

How do you distribute your work now to the CPU nodes? it should be done in the exact way - much like CPU node 1

shouldnt communicate with CPU node 8 (in your current CPU based cluster) GPU #1 shouldnt talk to GPU #8 nor should

the CPU nodes communicate with each other either.

I guess the main concept is that each GPU should be atomic and independant in its work - it gets its portion of the work

that should be done, process it and sends it back to the calling CPU thread …

feel free to ask anything else :)

eyal

Interesting discussion.

I have to say I don’t have experience dealing with so huge problems, hence the [theoretical] interest.

As we know, current Cuda cards are best utilized for single (as opposed to double) precision arithmetic. I anticipate, that the larger the problem, the more of an issue the loss of accuracy due to the single precision becomes. Do you run into issues of this nature in your codes at all?

Thanks!

Well for me (and as far as I understand in seismic in general) single is enough.

Moreover its probably not that much of an issue if one sample in time (from 2500-4000 samples) in one small portion of the overall result

would be 102346.2883 or 102345.99202 :)

Plus - you cant really know if the siesmic result is correct till you drill there ;) its a bit a flexibile ;)

eyal