Basic Questions: Implementing Parallel Computing

I’m very new to parallel computing. I starting myself off by trying to write a Jacobi Method code for CUDA. While I am writing, I have some very basic questions that will help me program.

How does a processor relate to a thread, block, and grid?

What is pinned memory, pageable memory, and page-locked memory? Advantages?

With respect to my Jacobi Method, How do I assign for example an equation to a thread? Or how do I know how many lines of my matrix will computed by each thread or block?

How will I get my threads to wait until all are computed before going to the next iteration of Jacobi?

I was looking through the “basics of CUDA” slides and I don’t understand the use of Memset. What is it for?
Malloc - set aside memory on device for my matrix ** If it is a matrix how does it know how much to allocate because technically it only knows that the first digit is a float or int?
Memcpy - copy matrix to device or to host or to other device

Thanks. These questions seem basic but that is because I have no experience with parallel computing.

  • James

It’s all in the CUDA programming guide.
You can skip the driver API and the texture functions on the first pass just to get a quickstart.


The thread/grid/block paradigm is really just an abstraction away from the details of the underlying processor. This makes it somewhat easier to write CUDA code than it otherwise would be, since you can more easily describe the parallelism in your algorithms.

Pinned/pageable memory thread:

For your jacobi method, you’d simply figure out what each thread is going to compute and that would be your kernel. For example a matrix addition kernel may simply take three pointers (one for the output and two for the input matrices) and size parameters for each, then just take one element from each matrix, add them, and store at the third pointer. The thread/grid/block paradigm will be used inside the kernel to calculate a unique index for the currently executing thread, which is how you calculate your offsets for your array pointers, like myArray[threadIdx.x] (or whatever). If you wanted to, each thread could compute several elements of the output…you’ll have to do some trial and error to see what works best.

To do repeated iterations, you’ll simply call the kernel repeatedly until some kind of condition is met. Normally, people will copy a bit of data back to host memory, check the condition, then call the kernel again if necessary.

When you call cudaMalloc() or cuMemAlloc(), you need to let it know the size of your matrix, not just the size of the element. The memcpy functions (e.g. cuMemcpy2D() ) are used to transfer data between the host and device.