Very big Bottleneck when reading files

my program is using cuda + opengl to dynamically deform meshes in real time.
it works great with no problems.

but im having a big bottleneck when importing these meshes from their files, because i have to read every single line of the file and copy that data to host as an intermediate step, then copy to device as a bunch of memory.
some meshes contain up to 2 million vertexes and takes more than 2 minutes to load( 160MB files ).this is too much time
can i use cuda to speed this?

What is the exact format of your mesh files? You cannot use CUDA for reading from disk, but you could probably do it more efficiently through improving your parser, using memory-mapping of your mesh files, etc.

the software supports these formats:

off
matlab (txt)
nx, ny,nz (txt)
comsol_1

but the most used is “off” format in my case, if i can speed up that to start with something, it would be great.
i am not so experienced in file reading, i always used classic method reading line per line.

Someone did once use CUDA to speed parsing of ASCII numeric data… you may search this forum for the topic.
The speedup was pretty good because the transfer time wasn’t the bottleneck, the sscanf() overhead of a single CPU core was.

If you’re using 2 minutes to load 160MB, that must be parsing speed since a modern hard drive can provide that much data in about 2-3 seconds (assuming one big continuous file) and sending it over the PCIe bus would take milliseconds.

If possible, you should prefer binary over text formats - binary formats have problems of their own (like endianness), but still, as mentioned by SPWorley above, parsing time is dominating this task, and you don’t have to parse much with binary formats. But if you have to deal with textual mesh formats, then you should definitely re-examine your parsing code - no way parsing 160MB could take 2 minutes on contemporary machine. I looked quickly into OFF file format, and come up with some quick tests - see attached files (I know the code is rather ugly, in many respects, but it should do for quick test). The file foo.c is generating 4M vertices in ~140MB OFF file, and then files bar1.c and bar2.cpp are parsing this format, the first one directly from disk, and the later by loading the whole file in memory, and then parsing from there. Timings on my machine are, for first version of reader:

[user@host tmp]$ gcc -o bar1 -Wall bar1.c && time ./bar1 foo.txt

real	0m3.347s

user	0m3.246s

sys	 0m0.100s

and for second version of reader:

[user@host tmp]$ g++ -o bar2 -Wall bar2.cpp && time ./bar2 foo.txt 

real	0m0.632s

user	0m0.309s

sys	 0m0.321s

I know OFF format is actually more complicated than these simple parsers are able to handle, but still it’s impossible to take minutes to load, and also you could see that everything is better than scanning file line-by-line. Of course, bar2.cpp code is really stupid, and there exist many better ways to improve parser; as pointed by SPWorley, I was wrong in my previous message, and you could even try to employ CUDA for this (albeit, as OFF format is somewhat irregular, I don’t think that this would help much here; but I guess this approach could be of big help on those record-based textual file formats, usually created by some old Fortran codes…).

EDIT: Hmm, seems like I’m really unable to make attachment to work on this forum,so here are mentioned files in-line:

foo.c

#include <stdio.h>

#define NVERT 4000000

int

main(int argc, char **argv)

{

	FILE		   *file;

	int			 i;

	if (argc < 2)

	return 0;

	file = fopen(argv[1], "w");

	fprintf(file, "OFF\n");

	fprintf(file, "%d %d %d\n", NVERT, NVERT / 3, 0);

	for (i = 0; i < NVERT; ++i)

	fprintf(file, "%f %f %f\n", 0.123456f, 0.123456f, 0.123456f);

	for (i = 0; i < NVERT / 3; ++i)

	fprintf(file, "%d %d %d %d\n", 3, 3 * i, 3 * i + 1, 3 * i + 2);

	fclose(file);

	return 0;

}

bar1.c

#include <stdio.h>

#include <stdlib.h>

int

main(int argc, char **argv)

{

	FILE		   *file;

	char			header[3];

	int			 nvert;

	float		  *pvert;

	int			 ntri;

	int			*ntrivert;

	float		 **ptrivert;

	int			 i;

	int			 j;

	if (argc < 2)

	return 0;

	file = fopen(argv[1], "r");

	fscanf(file, "%3s", header);

	fscanf(file, "%d%d", &nvert, &ntri);

	pvert = (float *) malloc(nvert * 3 * sizeof(float));

	for (i = 0; i < nvert; ++i)

	fscanf(file, "%f%f%f", &pvert[3 * i], &pvert[3 * i + 1],

		   &pvert[3 * i + 2]);

	ntrivert = (int *) malloc(ntri * sizeof(int));

	ptrivert = (float **) malloc(ntri * sizeof(float *));

	for (i = 0; i < ntri; ++i) {

	fscanf(file, "%d", &ntrivert[i]);

	ptrivert[i] = (float *) malloc(ntrivert[i] * sizeof(float));

	for (j = 0; j < ntrivert[i]; ++j)

		fscanf(file, "%f", &ptrivert[i][j]);

	}

	fclose(file);

	/* here, do something with the mesh */

	free(pvert);

	free(ntrivert);

	for (i = 0; i < ntri; ++i)

	free(ptrivert[i]);

	free(ptrivert);

	return 0;

}

bar2.cpp

#include <fstream>

#include <sstream>

#include <string>

int

main(int argc, char **argv)

{

	if (argc < 2)

	return 0;

	std::ifstream file(argv[1]);

	file.seekg(0, std::ios::end);

	size_t		  size = file.tellg();

	file.seekg(0, std::ios::beg);

	char		   *buffer = new char;

	file.read(buffer, size);

	buffer = 0;

	std::istringstream stream(buffer);

	std::string header;

	stream >> header;

	int			 nvert;

	int			 ntri;

	stream >> nvert >> ntri;

	float		  *pvert = new float[3 * nvert];

	for (int i = 0; i < 3; ++i)

	stream >> pvert[3 * i] >> pvert[3 * i + 1] >> pvert[3 * i + 2];

	int			*ntrivert = new int[ntri];

	float		 **ptrivert = new float *[ntri];

	for (int i = 0; i < ntri; ++i) {

	stream >> ntrivert[i];

	ptrivert[i] = new float[ntrivert[i]];

	for (int j = 0; j < ntrivert[i]; ++j)

		stream >> ptrivert[i][j];

	}

	delete[]buffer;

	/* here, do something with the mesh */

	delete[]pvert;

	delete[]ntrivert;

	for (int i = 0; i < ntri; ++i)

	delete[]ptrivert[i];

	delete[]ptrivert;

	return 0;

}

thanks!

im gonna check this deeply and i will answer back on the week i hope,

there software that was given to me has serius memory leaks im gonna have to repair them as first priority then keep developing

i tested your examples,

  1. generated the mesh with the mesh-generator (140MB file)

  2. run bar1

neoideo@neoideo-desktop:~/Galaxia/loadMesh$ time ./bar1 mallaGrande.off

real	0m3.906s

user	0m3.816s

sys	0m0.088s

–OK

  1. run bar2 for same mesh
neoideo@neoideo-desktop:~/Galaxia/loadMesh$ g++ -o bar2 -Wall bar2.cpp && time ./bar2 mallaGrande.off 

141555570 bytestransfered

Parsing Mesh.....

OK

real	0m6.864s

user	0m6.560s

sys	0m0.300s

note: i had to fix a line in your bar2.cpp code, on the first for statement i guess you meant

for (int i = 0; i < nvert; ++i)

( nvert instead of “3” constant ).

thats why your time shows so low i think,

i dont get why parsing from Memory(bar2) is taking longer than parsing from Disk (bar1). almost double time ??

test the bar2 code i provide here and let me know your times :)

bar2.cpp

#include <fstream>

#include <sstream>

#include <string>

#include <iostream>

int

main(int argc, char **argv)

{

	if (argc < 2)

	return 0;

	std::ifstream file(argv[1]);

	file.seekg(0, std::ios::end);

	size_t		  size = file.tellg();

	file.seekg(0, std::ios::beg);

	char		   *buffer = new char;

	file.read(buffer, size);

	std::cout << size << " bytes" << "transfered" << std::endl;

	buffer = 0;

	std::istringstream stream(buffer);

	std::string header;

	stream >> header;

	int			 nvert;

	int			 ntri;

	int			 nedg;

	stream >> nvert >> ntri >> nedg;

	//std::cout << "verts " << nvert << std::endl << "faces " << ntri << std::endl << "edges " << nedg << std::endl << std::endl;

	std::cout << "Parsing Mesh....." << std::endl;

	float		  *pvert = new float[3 * nvert];

	for (int i = 0; i < nvert; ++i){

		stream >> pvert[3 * i] >> pvert[3 * i + 1] >> pvert[3 * i + 2];

	}

	int			*ntrivert = new int[ntri];

	float		 **ptrivert = new float *[ntri];

	for (int i = 0; i < ntri; ++i) {

		stream >> ntrivert[i];

		ptrivert[i] = new float[ntrivert[i]];

		for (int j = 0; j < ntrivert[i]; ++j)

			stream >> ptrivert[i][j];

	}

	std::cout << "OK" << std::endl;

	delete[]buffer;

	/* here, do something with the mesh */

	delete[]pvert;

	delete[]ntrivert;

	for (int i = 0; i < ntri; ++i)

	delete[]ptrivert[i];

	delete[]ptrivert;

	return 0;

}

i tried to find that topic but just found a pair of unsolved file reading questions and other image proccesing topics.

do you have the link at hand?

i would really like to try parse a big string into array of numbers using cuda

One thing I don’t think has been mentioned yet is make sure you are reading the entire file into memory before doing any parsing with one single fread operation. Do NOT under any circumstances use the C++ STL iostream or text stream classes for parsing, as they are complete dogcrap for performance. I wrote a tool to work on meshes with tens of millions of vertices and polygons as well from OBJ text files and I sped up loading by about 30x by switch from STL text streams to fread’ing the entire file as a giant buffer of bytes and doing a byte by byte parser loop.

a solution if the bottleneck is due to hard disk read/save (and not parsing algorithm) is to make a ramdisk partition and use it to save/read data during execution (do not forget to save this partition before turn off our computer).
more info

im transferring the whole file into Memory first,

then im not sure how to parse correctly, since sscanf does not move the pointer like fscanf does.

is strtok too slow for this purpose?

this are my results loading a 160MB Mesh composed by

2 Million Vertex

4 Million Faces

parsing is included, they are stored in Numerical Arrays

using two methods, these are my results, please comment on them.

[b]

method 1: classic mesh reading with fscanf for every parse command.[/b]

TIME for parsing Vertexes = 2.390 secs

TIME for parsing Faces = 3.742 secs

method 2: passing the whole mesh to a memory buffer file using fread, then parsing with strtok_r .

TIME Vertexes: 2.867 secs

TIME Faces: 4.727 secs

both times are much better now compared to the beginning of this topic, i could use any of these and satisfy what i was looking for. i simplified the parsing as simple as just store the numbers on arrays since i want them as fast as possible on the GPU.

im just not sure which method to use, i was almost going to pickup method2 without checking method 1, but the surprise is that method 1 is faster which i didnt espect…

what other parsings options are recommended for method 2 ? strok is not slow but neither fast :(

help me decide :)