Troubles converting float array to float2 array

Hello,

Trying to convert a float array to an array of float2 type by transferring values of original float array to the real part of the float2 array (f2_array.x). The input float array is 2D.

The kernel I created to do this seems pretty straight forward, but there could be an error I’m not seeing. Is there a problem with the way the variable f2_array is declared?

Declaring the variable as “float2 *f2_array;” gives a segmentation fault when trying to output the values of the array.

Declaring as"float2 *f2_array;" gives a segmentation fault when the variable is declared.

Declaring as “static float2 f2_array;” outputs the following:

Cuda Array test:
21 21 21 21 21
26 21 16 21 12
21 21 40 23 14
16 40 26 24 13
21 23 24 21 18

Cuda Float2 Array test:
-256 -256 -256 -256 -256
0 0 0 0 0
0 0 0 0 0
0 0 0 0 0
0 0 0 0 0

These output values should be the same in both cases I believe.

#include <cufft.h>
#include <cuda_runtime.h>
#include <cuda.h>
#include <cuda_device_runtime_api.h>

#include <iostream>
#include <stdlib.h>
#include <stdio.h>

#include "header.h"

#define height 2048
#define width 2448
#define size 5013504



__global__ void datatransfer(float2 *f2, float *f)
{

int x = (blockIdx.x * blockDim.x) + threadIdx.x;
int y = (blockIdx.y * blockDim.y) + threadIdx.y;

if (x>0 && x<width && y>0 && y< height) {
f2[width*y+x].x =  f[width*y+x];
f2[width*y+x].y = 0;


}

}


int fastft(float *array)
{

//Create Variables
float2 *f2_array;
//or static float2 f2_array;
dim3 threadsPerBlock(153,128);
dim3 numBlocks(16,16);

//Display input Array values
std::cout<<"Cuda Array test: " <<std::endl;
for (int i=0;i<5;i++) {
	std::cout<<array[10*i] <<"  "<< array[20*i] <<"  "<< array[30*i] <<"  "<< array[40*i] <<"  "<< array[50*i] <<std::endl;
	} 

//Allocate memory on GPU
cudaMalloc((void **)&array,sizeof(float)*size);
cudaMalloc((void **) &f2_array, sizeof(float2)*size);

//Copy variables from host to device
cudaMemcpy(array,array,sizeof(float)*size,cudaMemcpyHostToDevice);
cudaMemcpy(f2_array,f2_array,sizeof(float2)*size,cudaMemcpyHostToDevice);

//Execute kernel that performs conversion from float --> float2
datatransfer<<<numBlocks,threadsPerBlock>>>(f2_array,array);

//Copy 
cudaMemcpy(f2_array,f2_array,sizeof(float)*size,cudaMemcpyDeviceToHost);

//Display results of conversion
std::cout<<" "<<std::endl;
std::cout<<"Cuda Float2 Array test: " <<std::endl;
for (int i=0;i<5;i++) {
	std::cout<<f2_array[10*i].x <<"  "<< f2_array[20*i].x <<"  "<< f2_array[30*i].x <<"  "<< f2_array[40*i].x <<"  "<< f2_array[50*i].x <<std::endl;
	} 

cudaFree(array);
cudaFree(f2_array);
return 0;

}

First of all, you should always implement proper cuda error checking, any time you are having trouble with a cuda code. Not sure what that is? Google “proper cuda error checking” and take the first hit, and read it, and apply it to your code.

Second, when having trouble, run your code with cuda-memcheck

I don’t know what all problems may exist in your code, since you haven’t provided a complete example that I could compile and test. If you really want help, and want to make it easy for others to help you, provide a complete code so I can copy, paste, compile and run, with minimum effort, and see the trouble you are having.

One problem is here:

dim3 threadsPerBlock(153,128);

This is illegal in CUDA. CUDA threadblocks are limited to a maximum of 1024 threads, which is the product of the individual dimensions. This is a very basic CUDA coding principle.

If you had used proper error checking, you would have gotten an immediate indication that the kernel launch is failing for an invalid parameter. This is a clue to check your launch configuration.

I see now thank you for the correction. I was under the impression threadblocks could withstand more than 1024 threads from reading that the max dimensions of a threadblock were (1024x1024x64). I use cuda error checking in my full program for memory allocation, memory copy, executing FFT, etc… but I did not think to try error checking for the kernel. I will do so and if there is still trouble I will provide a complete code able to be compiled.

I don’t see any evidence in what you have posted for error checking on memory copy and memory allocation either. I can only judge based on what I see, and it is such a commonly overlooked thing that I generally point it out when responding to such a post.

These constructs are not correct either:

cudaMemcpy(array,array,sizeof(float)*size,cudaMemcpyHostToDevice);
cudaMemcpy(f2_array,f2_array,sizeof(float2)*size,cudaMemcpyHostToDevice);

in each case, your source and destination pointers are the same. If you were actually using proper cuda error checking on those lines as you state, they should have thrown errors. You can copy from source to dest when the pointer is the same, but you cannot specify cudaMemcpyHostToDevice for such a transaction, when the allocation in question is an ordinary device (cudaMalloc) allocation.

So if those lines are your actual code, and you claim in your actual code that you are doing error checking, something doesn’t add up.

If those are not your actual code, retyping code into the browser that you’ve never actually compiled, run, or tested, is arguably wasting the time of those trying to help you, pointing out things that are problems but not in your “real code”.

Popular websites try to avoid such annoyances, recommending that posters who are asking for help debugging a code provide something like this:

http://sscce.org/

https://stackoverflow.com/help/mcve

Thanks for the pointers Bob I’ll keep these in mind. C++/CUDA and everything associated with these two is all completely new to me having only began a few weeks ago. I work in a biomedical imaging lab and have used Matlab the entire time. For the project I am working on we are using a TX2, which has been a very advanced feat for myself thus far, so I appreciate all the help you have offered.

Your code had various other issues. I’m not going to try and list all the changes I made, but here is a fully worked example built around the code you have shown, that runs without error for me, and seems to give a reasonable result:

$ cat t347.cu
#include <cufft.h>
#include <cuda_runtime.h>
#include <cuda.h>
#include <cuda_device_runtime_api.h>

#include <iostream>
#include <stdlib.h>
#include <stdio.h>

//#include "header.h"

#define height (2048ULL)
#define width (2448ULL)
#define size (height*width)



__global__ void datatransfer(float2 *f2, float *f)
{

int x = (blockIdx.x * blockDim.x) + threadIdx.x;
int y = (blockIdx.y * blockDim.y) + threadIdx.y;

if (x>=0 && x<width && y>=0 && y< height) {
f2[width*y+x].x =  f[width*y+x];
f2[width*y+x].y = 0;


}

}


int fastft(float *array)
{

//Create Variables
float2 *f2_array, *h_f2_array;
float *d_array;
//or static float2 f2_array;
dim3 threadsPerBlock(32,32);
dim3 numBlocks((width+threadsPerBlock.x-1)/threadsPerBlock.x, (height+threadsPerBlock.y-1)/threadsPerBlock.y);

//Display input Array values
std::cout<<"Cuda Array test: " <<std::endl;
for (int i=0;i<5;i++) {
        std::cout<<array[10*i] <<"  "<< array[20*i] <<"  "<< array[30*i] <<"  "<< array[40*i] <<"  "<< array[50*i] <<std::endl;
        }

//Allocate memory on GPU
cudaMalloc((void **)&d_array,sizeof(float)*size);
cudaMalloc((void **) &f2_array, sizeof(float2)*size);

//Copy variables from host to device
cudaMemcpy(d_array,array,sizeof(float)*size,cudaMemcpyHostToDevice);

//Execute kernel that performs conversion from float --> float2
datatransfer<<<numBlocks,threadsPerBlock>>>(f2_array,d_array);

//Copy
h_f2_array=(float2 *)malloc(sizeof(float2)*size);
cudaMemcpy(h_f2_array,f2_array,sizeof(float)*size,cudaMemcpyDeviceToHost);

//Display results of conversion
std::cout<<" "<<std::endl;
std::cout<<"Cuda Float2 Array test: " <<std::endl;
for (int i=0;i<5;i++) {
        std::cout<<h_f2_array[10*i].x <<"  "<< h_f2_array[20*i].x <<"  "<< h_f2_array[30*i].x <<"  "<< h_f2_array[40*i].x <<"  "<< h_f2_array[50*i].x <<std::endl;
        }

cudaFree(d_array);
cudaFree(f2_array);
return 0;

}

int main(){

  float *my_array=(float *)malloc(size*sizeof(float));
  for (int i = 0; i < size; i++) my_array[i] = i;
  fastft(my_array);
}
$ nvcc -arch=sm_30 -o t347 t347.cu
$ cuda-memcheck ./t347
========= CUDA-MEMCHECK
Cuda Array test:
0  0  0  0  0
10  20  30  40  50
20  40  60  80  100
30  60  90  120  150
40  80  120  160  200

Cuda Float2 Array test:
0  0  0  0  0
10  20  30  40  50
20  40  60  80  100
30  60  90  120  150
40  80  120  160  200
========= ERROR SUMMARY: 0 errors
$

Revised code. Still receiving troubles.

$ cuda-memcheck ./cuda_test ========= CUDA-MEMCHECK
Time passed CV_8UC1 --> CV_32FC1: 0.0132917
Image Type: 5

Array Test:
21 21 21 21 21
26 21 16 21 12
21 21 40 23 14
16 40 26 24 13
21 23 24 21 18

Cuda Array test:
21 21 21 21 21
26 21 16 21 12
21 21 40 23 14
16 40 26 24 13
21 23 24 21 18

Cuda Float2 Array test:
0 0 0 0 0
0 0 0 0 0
0 0 0 0 0
0 0 0 0 0
0 0 0 0 0
========= ERROR SUMMARY: 0 errors

g++ -std=c++11 main.cpp -I/usr/include -lopencv_core -lopencv_gpu -lopencv_highgui -lopencv_imgproc -lpthread -c -o cuda_test_C2C.o

nvcc f_to_f2.cu -I/usr/local/cuda-8.0/include -I/usr/include -lopencv_core -lopencv_gpu -lopencv_highgui -lopencv_imgproc -L/usr/local/cuda-8.0/lib64 -lcufft -lcudart -lcufftw -c -o f_to_f2.o

g++ main.o f_to_f2.o -I/usr/include -I/usr/local/cuda-8.0/include -lopencv_core -lopencv_highgui -lopencv_imgproc -L/usr/local/cuda-8.0/lib64 -lcufft -lcudart -lcufftw -lpthread -o cuda_test

main.cpp
#include <opencv2/core/core.hpp>
#include <opencv2/opencv.hpp>
#include <opencv2/highgui/highgui.hpp>
#include <opencv2/gpu/gpu.hpp>
#include <opencv2/gpu/gpumat.hpp>
#include <iostream>

#include <thread>
#include "header.h"

#define width 2448
#define height 2048
#define size 5013504


float *array=(float *)malloc(size*sizeof(float));


void thread1(cv::Mat img) 
{
	for (int i=0;i<height/4;i++)
		for (int j=0;j<width/4;j++) {
		array[width*i+j] = img.at<float>(i,j);
	}
}

void thread2(cv::Mat img) 
{
	for (int i=height/4+1;i<height/2;i++)
		for (int j=width/4+1;j<width/2;j++) {
		array[width*i+j] = img.at<float>(i,j);
	}
}

void thread3(cv::Mat img) 
{
	for (int i=height/2+1;i<height*.75;i++)
		for (int j=width/2+1;j<width*.75;j++) {
		array[width*i+j] = img.at<float>(i,j);
	}
}

void thread4(cv::Mat img) 
{
	for (int i=height*.75+1;i<height;i++)
		for (int j=width*.75+1;j<width;j++) {
		array[width*i+j] = img.at<float>(i,j);
	}
}


int main()
{

	
	cv::Mat image =cv::imread("im2_48.jpg",CV_LOAD_IMAGE_GRAYSCALE);
	
	
	double t = (double)cv::getTickCount();
	image.convertTo(image,CV_32FC1);
	t = ((double)cv::getTickCount() - t)/cv::getTickFrequency();

	
	std::cout << "Time passed CV_8UC1 --> CV_32FC1: " << t << std::endl;	
	std:: cout << "Image Type: " << image.type() << std::endl;
	
	
	std::thread t1(thread1,image);
	std::thread t2(thread2,image);
	std::thread t3(thread3,image);
	std::thread t4(thread4,image);
	t1.join();
	t2.join();
	t3.join();
	t4.join();
	std::cout<<" "<<std::endl;
	std::cout<<"Array Test: "<<std::endl;
	for (int i=0;i<5;i++) {
	std::cout<<array[10*i] <<"  "<< array[20*i] <<"  "<< array[30*i] <<"  "<< array[40*i] <<"  "<< array[50*i] <<std::endl;
	} 
	
	fastft(array);
	
return 0;

}
f_to_f2.cu
#include <cufft.h>
#include <cuda_runtime.h>
#include <cuda.h>
#include <cuda_device_runtime_api.h>

#include <iostream>
#include <stdlib.h>
#include <stdio.h>

#include <opencv2/core/core.hpp>
#include <opencv2/opencv.hpp>

#include "header.h"

#define height (2048ULL)
#define width (2448ULL)
#define size (height*width)



__global__ void datatransfer(float2 *f2, float *f)
{

int x = (blockIdx.x * blockDim.x) + threadIdx.x;
int y = (blockIdx.y * blockDim.y) + threadIdx.y;

if (x>0 && x<width && y>0 && y< height) {
f2[width*y+x].x =  f[width*y+x];
f2[width*y+x].y = 0;


}

}


int fastft(float *array)
{

//Conversion from float to float2

float2 *f2_array,*dev_f2_array;
float *dev_array;

std::cout<<" "<<std::endl;
std::cout<<"Cuda Array test: " <<std::endl;
for (int i=0;i<5;i++) {
	std::cout<<array[10*i] <<"  "<< array[20*i] <<"  "<< array[30*i] <<"  "<< array[40*i] <<"  "<< array[50*i] <<std::endl;
	} 

if (cudaMalloc((void **)&dev_array,sizeof(float)*size) != cudaSuccess)
	{
	fprintf(stderr,"Cuda Error: failed to allocate dev_array");
	}
if (cudaMalloc((void **) &dev_f2_array, sizeof(float2)*size) != cudaSuccess)
	{
	fprintf(stderr,"Cuda Error: failed to allocate dev_array");
	}

cudaMemcpy(dev_array,array,sizeof(float)*size,cudaMemcpyHostToDevice);
if (cudaGetLastError() != cudaSuccess) 
	{
	fprintf(stderr, "Cuda Error: Failed to copy to dev_array\n");	
	}


dim3 threadsPerBlock(32,32);
dim3 numBlocks((width+threadsPerBlock.x-1)/threadsPerBlock.x,(height+threadsPerBlock.y-1)/threadsPerBlock.y);

datatransfer<<<numBlocks,threadsPerBlock>>>(dev_f2_array,dev_array);

if (cudaGetLastError() != cudaSuccess) 
	{
	fprintf(stderr, "Cuda Error: Failed to execute kernel\n");	
	}

f2_array=(float2 *)malloc(sizeof(float2)*size);

cudaMemcpy(f2_array,dev_f2_array,sizeof(float)*size,cudaMemcpyDeviceToHost);
if (cudaGetLastError() != cudaSuccess) 
	{
	fprintf(stderr, "Cuda Error: Failed to copy to f2_array\n");	
	}

std::cout<<" "<<std::endl;
std::cout<<"Cuda Float2 Array test: " <<std::endl;
for (int i=0;i<5;i++) {
	std::cout<<f2_array[10*i].x <<"  "<< f2_array[20*i].x <<"  "<< f2_array[30*i].x <<"  "<< f2_array[40*i].x <<"  "<< f2_array[50*i].x <<std::endl;
	} 

if (cudaFree(dev_array) != cudaSuccess) 
	{
	fprintf(stderr,"Failed to free dev_array\n");
	}

if (cudaFree(dev_f2_array) != cudaSuccess) 
	{
	fprintf(stderr,"Failed to free dev_array\n");
	}

return 0;

}
// header.h

#ifndef _HEADER_H
#define _HEADER_H

#include <iostream>


int fastft(float *array);

#endif

I don’t believe there is anything seriously wrong with your code in f_to_f2.cu I used it more or less verbatim and built a sample code around it and it works fine (omitting OpenCV stuff, which is basically irrelevant for that file.)

I can’t explain why you are getting zeroes. My best guess is that there is something wrong with your TX2 board, or the CUDA install on it. Are you able to successfully run the CUDA sample codes vectorAdd, deviceQuery, and bandwidthTest on it?

If so I am almost out of ideas. One other extreme stretch idea is that you are not linking against what you just built. This command:

g++ … -c -o cuda_test_C2C.o

produces a linkable object called cuda_test_C2C.o

but your final link phase:

g++ main.o f_to_f2.o … -o cuda_test

does not link against that object. It links against main.o, which does not show up anywhere in the previous build (compile) steps you have indicated. I initially assumed this was just another typo, but if you have a main.o sitting around from some previous thing, it may be confusing the issue. Having said all that, I cannot actually put together a theory around this that would explain the output you have posted.

(There is also something I have never witnessed in all my years of C programming in your file main.cpp, but the construct appears to work, so I guess it must be valid syntax. Anyway I don’t think it is relevant to whatever you are currently dealing with.)

I feel a bit bad about sending you on this duck hunt with a rather blase comment about using C2C rather than R2C. Unfortunately you’re struggling quite a bit with what I thought was a trivial suggestion. But perhaps the joy is the journey, not the destination. If you’re frustrated, it’s not a ridiculous idea to go back to using R2C. It’s quite sensible. But something seems to be amiss with your TX2 anyway.

That silly typo for compilation might be the problem. I will check Monday.

You must be talking about the threads I created to transfer the data from the OpenCV Mat to an array. I thought it was interesting idea to speed up data transfer between array type variables. With one ‘for’ loop the data transfer took 20ms, way slower than a typical FFT from what I’ve seen. So I tried creating threads to fill the variable and reduced it down to ~4ms. There might be better or way more obvious way to do this, but my background is in engineering, not programming so this seemed like an easy way to speed things up. In the actual program the image will come from a camera in a FlyCap2 Image data type, which needs to be transferred to a C++ array to do the FFT. I am not sure how to pass data from FlyCap2 Image, or OpenCV Mat data into a cuda kernel otherwise that would probably be the best option.

I was referring to this, in main.cpp:

float *array=(float )malloc(sizesizeof(float));

It turned out to be a difference in C vs. C++

What I previously thought was this:

https://stackoverflow.com/questions/6742820/malloc-function-dynamic-memory-allocation-resulting-in-an-error-when-it-is-use

And your code seems to violate that, but there are no compile errors and I don’t see any runtime or functional issues when I test it. After further study this appears to be a difference between C and C++ that I was not aware of. It is apparently illegal in C but legal in C++. The compiler will complain about that code in C but not C++.

I’m aware of course that there are plenty of opportunities in C++ to initialize things prior to main, but I would have thought such things would need to be in constructors or the like, ie. function code should be wrapped in curly braces, somehow. But apparently not.

txbob, in C++ i usually use constructions like that to call a function at initialize-time:

static int nonused = CallMe();

probably in C++ arbitrary expression can be used for initialization, and afaik initialization expressions are topologically sorted so “a=b+1” is executed prior to “c=a+2”

I have a new post simplifying this problem at this link. https://devtalk.nvidia.com/default/topic/1016690/cuda-programming-and-performance/very-simple-cuda-program-bad-output-/