3D arrays - where to start?

I am completely new to CUDA, and my reason for wanting to try it out is to see if I can take a C++ script that calculates some large NX x NY x NZ 3D array, and see if it can compute more quickly on my Titan. Right now it’s using OpenMP to parallelize the main calculation which is nice, but I’d like to see if I can obtain more of a speedup using my GPU.

Looking over CUDA now, I understand the very, very basics so far : allocate memory on host and device -> copy variables to device -> run calculations on device -> copy back to host. This didn’t seem too bad at all with basic 1D array calculations. But moving on to the 3D arrays, I am completely lost. What I’m reying to do is the following…

  1. Define a float 3D array, called T, and fill it with a default value of 20.0f
  2. Copy this float T, to the device.
  3. Run the calculation, but calculate a new float 3D array called Tn
for (int i=1; i<Nx; i++) {
    for (int j=1; j<Ny; j++) {
         for (int k=1; k<Nz; k++) {
              Tn[i][j][k] = T[i][j][k] + ...
  1. Obtain the final float Tn 3D array from device back to host

I didn’t think finding a similar example to follow would be tough, but apparently I can’t seem to find anything I can really follow along with. If anyone knows of any good tutorials (section 3.2.2 didn’t help at all to me from Nvidia’s website) on how to pass, calculate, and obtain 3D arrays using CUDA, please let me know. Thanks.

Well, for a first try, you can use the following: A 3D array with the size vector (Nx, Ny, Nz) is not much different to a 1D array with a size of NxNyNz. Therefore, to start working with CUDA I would recommend to just use one-dimensional arrays.

This is not optimal in many cases in terms of memory layout and therefore, there are ways to use 2D and 3D arrays in CUDA as well, e.g. cudaMalloc3D, cudaMemcpy3D and so on.

As Markus commented, you want to look at flattening your arrays, as traditional 2D or 3D arrays will be very slow in CUDA. Here is a similar thread about 4D storage:


You could also use one of the CUDA built-in structures, like int3, float3, etc.

See also the last response in: http://stackoverflow.com/questions/7322810/3d-array-representation-cuda

I went with flattening a 3D array. Unfortunately, my calculation via GPU is very, very slow compared to the CPU. The GPU takes about ~2.8 seconds to complete the calculation. The CPU (3770K 4.6ghz 8-threads using OpenMP) takes ~0.75 seconds. This isn’t the final calculation I’m looking for, but it would be similar, and looped over and over again. I have no idea what to do with “number of blocks” and “number of threads per block” right now. I’ve been trying all different combinations (complete guesses to be honest) but it’s not doing much for calculation time.

But either way, for now I’m just happy I just get output, and it matches CPU output.

#include "cuda_runtime.h"
#include "device_launch_parameters.h"

#include <stdio.h>
#include <vector>
#include <iostream>
#include <time.h>

using namespace std;

//Transform 3D coordinates into 1D coordinate
int getIndex(int x, int y, int z, int height, int width) {
	return x + y*height + z*height*width;

__device__ int arrayIndex(int x, int y, int z, int height, int width) {
	return x + y*height + z*height*width;

__global__ void calcTn(float dx, float dy, float dz, float dt, float alpha, int xLen, int yLen, int zLen, float *T, float *Tn)
	for(int i=1; i<xLen-1; i++) {
		for (int j=1; j<yLen-1; j++) {
			for (int k=0; k<zLen; k++) {
				int i_j_k = arrayIndex(i, j, k, yLen, xLen);
				int bi_j_k = arrayIndex(i-1, j, k, yLen, xLen);
				int fi_j_k = arrayIndex(i+1, j, k, yLen, xLen);
				int i_bj_k = arrayIndex(i, j-1, k, yLen, xLen);
				int i_fj_k = arrayIndex(i, j+1, k, yLen, xLen);
				int i_j_bk = arrayIndex(i, j, k-1, yLen, xLen);
				int i_j_fk = arrayIndex(i, j, k+1, yLen, xLen);

				Tn[i_j_k] = T[i_j_k] + (dt * (alpha/(dx*dx) * (T[fi_j_k] - 2*T[i_j_k] + T[bi_j_k]))) + (dt * (alpha/(dy*dy) * (T[i_fj_k] - 2*T[i_j_k] + T[i_bj_k]))) + (dt * (alpha/(dz*dz) * (T[i_j_fk] - 2*T[i_j_k] + T[i_j_bk])));

int main()

	clock_t init, final;

	int xNodes = 128;
	int yNodes = 128;
	int zNodes = 32;
	int N = xNodes*yNodes*zNodes;

	float dx, dy, dz, dt;
	dx = 0.1;
	dy = 0.1;
	dz = 0.1;
	dt = 0.0001;

	float cond, cp, rho, alpha;
	cond = 400.0f; // (W/(mK))
	cp = 385.0f; // J/(kgK)
	rho = 8940.0f; // kg/m^3
	alpha = cond*cp/rho;
	int Tini = 20.0f;

	//Initializing variable sized temperature arrays
	std::vector<float> T(N, Tini);
	std::vector<float> Tn(N, Tini);

	//Initializing lower wall to be T = 100
	for(int i=0; i<xNodes; i++)
		for(int j=0; j<yNodes; j++)
			int ikx = getIndex(i, j, 0, yNodes, xNodes);
			T[ikx] = 100.0f;
			Tn[ikx] = 100.0f;

	//Convert from std::vector to float to get working with CUDA
	float *Q = &T[0];
	float *Qn = &Tn[0];

	float *d_T, *d_Tn, *oTn;

	oTn = (float*)malloc(N*sizeof(float));

	cudaMalloc(&d_T, N*sizeof(float));
	cudaMalloc(&d_Tn, N*sizeof(float));
	cudaMemcpy(d_T, Q, N*sizeof(float), cudaMemcpyHostToDevice);
	cudaMemcpy(d_Tn, Qn, N*sizeof(float), cudaMemcpyHostToDevice);

	calcTn<<<1,1024>>>(dx, dy, dz, dt, alpha, xNodes, yNodes, zNodes, d_T, d_Tn);

	cudaMemcpy(oTn, d_Tn, N*sizeof(float), cudaMemcpyDeviceToHost);

	//Final output
	cout << "Calculation Time : " << (double)final / ((double)CLOCKS_PER_SEC) << " seconds" << endl;

	int arrayInt = getIndex(50, 50, 1, yNodes, xNodes);
	cout << "T[50][50][1] = " << oTn[arrayInt] << endl;

Well, in your CUDA code, you are not using threads to distribute work but every thread computes every index. This cannot be faster than a CPU.

You have to use threadIdx.x, threadIdx.y and threadIdx.z and you further have to make use of block and grid dimensions. Please read the CUDA C Programming Guide how to do it.

when you submit a program one would use kernel<<<blocks,threads>>>(…).

Where blocks and threads aare defined as:
dim3 blocks,threads;

You give values as

In the kernel you have intrisic variables :

blockIdx.x,blockIdx.y,blockIdx.z and threadIdx.x,threadIdx.y,threadIdx.z. Which you can map to an array up to 6 dimensions. If you really wanat to keep a 3D array you can declare a global array as:

device dev_a[nx][ny][nz]; // this array is now visible to all kernels, but if you want to copy the data from gpu to cpu you have to get the address first.

I used something like this for testing.

I actually was going to ask how CUDA “knew” how to run the code in parallel, but forgot. Been looking over some slides and pdfs about blocks, threads, and grids, so I’ll be giving that a shot later. But it’s good to know though that what I just did was rightfully slow.

I got it working now threaded, and I’m seeing a huge performance boost. I still have a few more questions…

  1. When calling func<<<blocks,threads>>> is there any reason not to max out threads if N = blocks*1024? In my case, it’s 1024 threads for Titan.

  2. When each thread runs through the kernel, are variables that are calculated within the kernel, exclusive to that thread? For example :

__global__ void calcTn(float dx, float dy, float dz, float dt, float alpha, int xLen, int yLen, int zLen, float *T, float *Tn)
	int index = blockIdx.x * blockDim.x + threadIdx.x;

	if(index < xLen*yLen*zLen) {
		//Get ijk indices from each index
		int k = index/(xLen*yLen); 
		index -= k*xLen*yLen; 
		int j = index/xLen; //maybe here yLen 
		index -= j*xLen;
		int i = index/1;

                //rest of the code omitted

When I am calculating in i, j, k. Does this present a possible thread racing issue? Can for example, thread 55 i,j,k accidentally be used in thread 2? It seems like my code is not having the issue, but I just want to be sure.

Also, a quick initial benchmark…

512x512x64 array time stepped 100 times (100 loops)

CPU : 3770K 4.6ghz 8-threads via OpenMP = 459 seconds
Titan : stock = 6 seconds!!!

I think in your case the i,jk indeces are defined so that are unique to each thread. Check that the results from the gpu are giving the right results (check against some known result).

I agree with pasoleatis. Have you verified the results are correct? Your example code has no error checking of CUDA calls, so if anything fails (which is quite possible if your launch configuration is wrong or there is a memory access violation), you won’t know it. It looks like your algorithm will be memory bandwidth bound on the GPU, so I would have only expected a performance improvement of a factor of 10-20x over the CPU, rather than a factor of 76x.

Also, a quick read of your code makes me worried that you might be exceeding your array bounds in the z-dimension because the loop over “k” is not restricted to run between 1 and zLen-1 (like the x and y-dimensions), yet you still access the array at k-1 and k+1 in your nearest neighbor lookup.

That code is dated now. I am no longer calculating 3D inside of the kernel. I’ve taken the previous advice of flattening the array. I have two functions that will figure 3D indices to 1D index, or vice versa. But, you were right about the z-dimension. It should have been zLen-1.

For very basic error checking, is simply adding the following sufficient just to see if any error at all occurred?

cudaStatus = cudaGetLastError();
if (cudaStatus != cudaSuccess) {
	cout << "CUDA error occured :" << cudaGetErrorString(cudaStatus) << endl;

The reason I ask is because I have my kernel calls setup like this in the host thread

for(int t=0; t<100; t++) {
	cudaMalloc(&d_T, N*sizeof(float));
	cudaMalloc(&d_Tn, N*sizeof(float));
	cudaMemcpy(d_T, Qn, N*sizeof(float), cudaMemcpyHostToDevice);
	cudaMemcpy(d_Tn, Qn, N*sizeof(float), cudaMemcpyHostToDevice);

	calcTn<<<numBlocks,numThreads>>>(dx, dy, dz, dt, alpha, xNodes, yNodes, zNodes, d_T, d_Tn);
	boundaryTn<<<numBlocks,numThreads>>>(xNodes, yNodes, zNodes, d_T, d_Tn);

	cudaMemcpy(Qn, d_Tn, N*sizeof(float), cudaMemcpyDeviceToHost);

I have to loop the calculation over and over, with the next calculation being based off the previous. So I’m always copying data back and forth between the device and host. My concern for now is error checking slowing down the calculation. So instead of checking every single time I use a cuda function, check say every X amount of loops, and then dig further for the error if one shows up.

Edit : Nevermind this reply.

Yes, you can wait to check for errors outside of your calculation loop. Some people also make a C preprocessor macro (like “checkCudaErrors” in the CUDA sample programs) that they #define to a no-op when they are not debugging.