Semi unrelated to CUDA Using ATLAS in conjunction with .cu file

I was just wondering if anyone on here has experience using ATLAS (math-atlas.sourceforge.net). I am having trouble installing it :argh: and the support forum on the sourceforge page is not frequently looked at. I am installing it so that I can run some code on my GPU then after getting the results, finish with using ATLAS until I can develop a GPU version of the code that ATLAS calls, where applicable. If you have any experience installing on a windows platform, or might be able to help (it’s probably a problem with my lack of unix skills), please post here or email me at cava0093@umn.edu . Any help is much appreciated :thumbsup:

Joe

I understand your pain. Compiling ATLAS is not for the faint of heart. IIR the author recommends prebuilt binaries specific to your system. Since the tuning of the BLAS is done at compile time, you pretty much shouldn’t use your computer during compile to not skew the timing results. Plus what is usually desired is a library of LAPACK that uses ATLAS for the underlying BLAS. Unfortunately, I’ve never compiled ATLAS for windows so I probably can’t help you any more then google.

What I can do is warn you to not use cudaMallocHost when allocating host memory that will be passed into ATLAS. Memory allocated with cudaMallocHost is only guaranteed to be valid in the host thread associated with the cuda context that made the allocation, and since ATLAS can be made to be multi-threaded it is quite easy to spend two weeks tracking down intermittent seg faults when using cudaMallocHost, trust me. Good luck.

Can you give me a bit of info/a reference on this? This actually sounds a lot like something I really want to do, and it could be very inconvenient… The only reason I can think of for this to be a problem is virtual memory funkiness, but all threads share the same page table, so I doubt that’s the reason.

So I ended up just downloading Intel MKL (math kernel library) which at least gives me a general idea for my comparisons. I shouldn’t need to use it after the evaluation period is up, so that isn’t a problem and I am well within the terms of agreement. It was pretty easy to set up with VS 2005 Express Edition. I just had to grab one file (uuid.lib or something) from the SDK and then it worked flawlessly. I tried getting support on the sourceforge for ATLAS, but it seems kind of dead, which is why I just used MKL.

I actually just spent a fair amount of time searching the forums for the reference that talked about memory allocated with cudaMallocHost being specific to the cuda context associated with the CPU thread that allocated the memory…but without any luck, so I may have misspoke.

The problem that I alluded to in my previous post occurred in an application that has multiple GPU worker threads performing essentially the same algorithm in parallel. One of the steps requires a fairly small matrix inversion. So we copy the data down to the host and use some LAPACK routines (built using ATLAS), then copy the data back to the device. The behavior we observed is that when the memory on the host was allocated with cudaMallocHost, the application would crash in this routine maybe on 4% or less of the calls. Simply changing the allocation to malloc improved stability greatly.

Another twist, is that we recently (like yesterday) started having some stability problems again. Our current theory is that our LAPACK library may not be completely thread safe, and we may be experiencing race condition behavior when the multiple threads are all inside LAPACK. We have since switched to IMKL and haven’t had any issues since (~20,000 calls and counting). So in short, perhaps cudaMallocHost wasn’t the issue at all and switching to malloc just tickled the race condition inside LAPACK differently. Sound confusing? Needless to say, debugging this crash, that occurs intermittently once per hour(s) has been unpleasant.