Converting square matrix to upper triangular form in Cuda.

Dear Cuda programmers.
I’ve just started my way with developing in Cuda and have stuck in a simple task.
Task: Write a Cuda code that will convert a square matrix into an upper triangular form. The same task in OpenMp went really smooth (thanks to my teammate)

#include <stdlib.h>
#include <omp.h>
#include “utp-2.h”


  • The second version parallizes both of the inner loops
    void upper_triangle_2 ( double **A, int N, int num_of_threads )
    int i, j, k;
    double ratio;

for ( i = 0; i < N-1; i++ )
#pragma omp parallel for private(k, ratio) shared(N)
for ( j = i+1; j < N; j++ )
ratio = A[j][i] / A[i][i];
#pragma omp parallel for
for ( k = i; k < N; k++ )
A[j][k] -= ratio * A[i][k];
But I’ve stuck in Cuda version of this task.

#include “…/…/lib/book.h”
#define N (100)

global void matrixtransform(double **MatA){
int i,j,k;
double ratio =1.0;

int ix = threadIdx.x + blockIdx.x * blockDim.x;
    int iy = blockIdx.y * blockDim.y + threadIdx.y;

for (i=0; i<N-1; i++){
	for (j =i+1; j<N; j++){
		ratio= MatA[j+ix][i*iy]/MatA[i*iy][i*iy];
		for (k=i; k < N; k++){	


int main (int argc, char *argv){
double a[N][N];
double **dev_a;
float elapsedTime;

printf("--- ADAS Upper Triangular Matrix Converter - Cuda Version---\n");
// allocate the memory on the CPU

// allocate the memory on the GPU

HANDLE_ERROR( cudaMalloc( (void**)&dev_a, N * N * sizeof(double) ) );

// fill the arrays 'a' and 'b' on the CPU
//mat_rand(a, N, N);
for (int i=1; i<N; i++){
	for(int k=1; k<N; k++ )
		a[i][k] = k*1.0;


// copy the array 'a'  to the GPU

HANDLE_ERROR( cudaMemcpy( dev_a, a, N * N* sizeof(double), cudaMemcpyHostToDevice ) );

cudaEvent_t start, stop;
cudaEventRecord(start, 0);


cudaEventRecord(stop, 0);
cudaEventElapsedTime(&elapsedTime, start, stop);
printf("elapsed time = %g msec\n", elapsedTime);

// free the memory allocated on the GPU
HANDLE_ERROR( cudaFree( dev_a ) );
//HANDLE_ERROR( cudaFree( dev_b ) );

return 0;

When I compile the code with nvcc: nvcc -o filename -lpthtread
I get following error: Cannot tell what pointer points to, assuming global memory space

Any ideas what should i do?

Thank you very much in advance

No ideas? :)

That code has multiple issues. [font=“Courier New”]double **A[/font] and [font=“Courier New”]double A[N][N][/font] are two very different things in C. The latter is a two-dimensional array, the former is a pointer to a pointer to double.


can’t work in CUDA if [font=“Courier New”]j[/font] and [font=“Courier New”]k[/font] are the same for all threads (it doesn’t work even if the are the same for two threads). You need to replace it with an atomic operation (which unfortunately doesn’t exist for double precision, but you can find a software implementation in appendix B.17 of the CUDA C Programming Guide)

atomicAdd(&(MatA[j][k]), ratio*MatA[i][k]);

Of course this is not efficient. It would be far better to reorganize you code to have one thread per output, so that no two threads write to the same matrix element and atomic operations are not needed.

Thanks tera

Could you please tell me more about

thanks again.