cuda-x86 documentation

Is there any detailed documentation for cuda-x86? What features of CUDA are not supported? How are threads and thread blocks mapped to CPU cores and threads? Will a CUDA program that also uses POSIX threads compile correctly under cuda-x86 or will the pthreads calls interfere with the thread mapping done by cuda-x86? I am compiling a major (> 1million lines) CUDA code under cuda-x86 and find that there apparently is a problem with pthreaded CUDA codes. I removed the pthreading in the CUDA code and was able to compile and run without seg faults, but the results are incorrect. I need to figure out first how to compile and run the code correctly, but after that I will also need to understand how to make it perform well. So I need to understand the cuda-x86 implementation much better than what is given in the documentation I have found so far (PGIInsider articles, etc.). Finally, I see that some form of textures is supported (since the simpleTexture example works), but I need to know the limits of this – our CUDA code uses textures heavily and I am wondering if that is part of the problem.

Hi Todd,

The low-level implementation is not documented well at this time. Probably the best way to understand the approach we took is to look at the file pgi_cuda_x86.h which is provided in and used from the release include directory.

When a kernel can be optimized, we basically run an entire CUDA threadblock on a single thread, and run the blocks work-shared within an OMP parallel region, collapsed over the gridDim dimensions.

When a kernel cannot be optimized, we run OMP_NUM_THREADS x86 threads which spawn blockDim tasks, and then extend our OMP runtime to allow the tasks to synchronize, in support of CUDA __syncthreads() calls.

The compiler -Minfo option will inform you which CUDA functions could be
optimized and which could not, and hopefully why not.

The main thrust of the work on CUDA-x86 was completed around the CUDA 3.2 timeframe. We have kept the base functionality working as new CUDA versions have been released, but have not added any of the new CUDA features as of this time. It is on our roadmap to update CUDA-x86
once CUDA 5.0 comes out.

The texture support should be solid. We run a suite of tests that compare our results versus CUDA for normalized, wrapped, clamped, linear filtering, 1, 2, and 3 dimension lkups and 1, 2, and 4 return values. It is possible we are missing some cases, but let us know what you find.

Yes, explicit pthread calls and our OpenMP runtime can conflict and cause failures. This is a known issue. The safest approach, for now, is to port the pthreads portions to OpenMP.

We’d be happy to work with you to get your code ported and make CUDA-x86 a better product.

OK, this is helpful and will get back to you with further questions. Did get everything working last night (for single precision – not sure why but still have troubles with double). However, now it tells me:

Warning: Number of emulated threads (1) is less than available cpus (8)

Is there a way to tell it to use more emulated threads? Or is this a consequence of the grid/block/thread structure I used? (I set it to 1 thread/block to make everything work)


We print that warning just in case you are debugging using pgdbg. Not every x86 thread will hit a breakpoint, and it is usually a frustrating experience.

It is probably a result of your launch configuration. You can dynamically adjust the number of x86 threads that emulate the CUDA threads by explicitly calling omp_set_num_threads, if you want. By default, when we compile CUDA-x86, it is as if you used -mp=allcores, and we default to running a number of x86 threads equal to the number of cores.

What is the expected performance of cuda-x86? The code in question has also been ported to OpenCL (no changes to the code structure, just adding all the OCL headers, etc.). In that case, the performance on an Intel Xeon was 15x slower than what is observed on an nvidia GPU. But with cudax86, we are seeing performance that is almost 1000x slower than on a single GPU (same as above, Fermi class). What is the range of performance figures you see in real world codes (e.g., more than 50 CUDA kernels)? Is this kind of performance hit believable, or does this suggest there are some simple optimizations?

We have seen huge performance swings between optimized and non-optimized kernels as defined above. If the kernels are small, there are a large number of CUDA threads in the thread-blocks, and the kernel is not optimized, then the x86 tasks basically spend all of their time context switching. And you get lousy performance.

If the kernels are optimized (which is enabled starting with -O2 optimization, but we recommend you use -fast) then you should see performance relative to a decent OpenMP implementation of the same algorithm on that hardware.