Help With Some Simple Code

Hello. I’m currently engaged in a third year project in computer science at the University of Manchester, UK. I’m attempting to simulate the image artifacts that occur when imaging objects with nanoscale structure (using techniques such as atomic force microscopy).

I’m attempting to work out the distance from a point on the tip (represented as a 3D volume) to another point in space (also represented as a 3D volume), and then working out the force at that point (based on this distance value). The code I have working in java currently is as follows:

[codebox] //tip position is just an int {l, m, n}

		for (int i = 0; i < returnMatrix.slices(); i++) {

			for (int j = 0; j < returnMatrix.rows(); j++) {

				for (int k = 0; k < returnMatrix.columns(); k++) {

					int[] gridPosition = {i, j, k};

					double force = calculatePotential(gridPosition, tipPosition);

					double currentValue = returnMatrix.getQuick(i, j, k);

					returnMatrix.setQuick(i, j, k, currentValue + force);

				}//for k

			}//for j

		}//for i[/codebox]

I made an attempt at writing this in CUDA, and I think it ought to work, but it doesn’t. Most of the values come out wildly wrong, or NAN.

[codebox]global void calculateForceMatrix(int N, float *result, float tipGridSpacing, float objectGridSpacing, float surfaceCentre, float tipCentre, int surfaceMatrixHeight, int l, int m, int n){

int i = blockIdx.x * blockDim.x + threadIdx.x;

int j = blockIdx.y * blockDim.y + threadIdx.y;

int k = blockIdx.z * blockDim.z + threadIdx.z;

if(i < N && j < N && k < N){

	//result[i][j][k] = dist(i,j,k);

	float xDimension = ((tipCentre-i) * tipGridSpacing) + ((surfaceCentre-l) * objectGridSpacing);

	float yDimension = ((tipCentre-j) * tipGridSpacing) + ((surfaceCentre-m) * objectGridSpacing);

	float zDimension = ((surfaceMatrixHeight - k) * objectGridSpacing) + (n * tipGridSpacing);

	float distance = fabs(xDimension) + fabs(yDimension) + fabs(zDimension);

	float sixthPow = pow(1/distance, 6);

	float closeTerm = sixthPow * sixthPow;

	float farTerm = sixthPow;

	float lennardJonesForce = -24*(2*closeTerm - farTerm);


	int test = i*N*N + j*N + k;

	result[test] = lennardJonesForce;

	//int index = ((i*N*N)+{j*N)+k);

	//result[index] = lennardJonesForce;



[codebox] float *a_h, *a_d; // Pointer to host & device arrays

size_t size = N * N * N * sizeof(float);

a_h = (float *)malloc(size);        // Allocate array on host

cudaMalloc((void **) &a_d, size);   // Allocate array on device


// Do calculation on device:

dim3 dimBlock(N, N, N);

calculateForceMatrix<<<1, dimBlock>>>(N, a_d, tipGridSpacing, objectGridSpacing, surfaceCentre, tipCentre, surfaceMatrixHeight, l, m, n);

// Retrieve result from device and store it in host array

cudaMemcpy(a_h, a_d, sizeof(float)*N*N*N, cudaMemcpyDeviceToHost);

for (int i=0; i<N*N*N; i++) printf("%d %f\n", i, a_h[i]);[/codebox]

I’m sure i’m doing something blindingly obviously wrong, but I just can’t see it. Any help would be much appreciated.



I scanned the code and nothing looks “wrong” although a few things I did notice:

  1. Distance is the sum of absolute values of x, y, and z, which is a bit odd. I would have expected it to be the square, but I don’t know the mathematics of your problem. I’ll assume absolute value is what you want.

  2. For a constant integer power, it’s generally better to use multiplication instead of pow(). Also, pow() is a double-precision function and I’m not sure how it behaves without compute capability 1.3. Prefer powf() if you’re using float data. EDIT: Nevermind, on 1.2 and lower, double-precision functions are mapped to their single-precision equivalents, so it won’t give bad values. But you should still prefer multiplication for constant integer powers.

  3. A thread block that’s N x N x N will only work if N is 8 or less. I assume this is the case or it wouldn’t run. And I also assume you know that running a single block will underutilize your GPU.

None of these things is really wrong, but that’s all I saw.

  1. Yep, my mistake :) I was tired…

  2. Yeah, of course. I typically switch it to that once I know it’s working (it’s easier to read with pow).

  3. I actually didn’t know that. I thought you could have any number of threads per block. Realistically N is going to be about 60 :S I also didn’t know that running a single block would under utilize the GPU. Clearly I’ve still got a lot to read up on.