I have a question about what is the best way to parsing pdf file in to text file for later NLP applications. Right now I found out there’s not a lot of good parsers that can handle pdf from different source. Some pdf docs can be parsed perfectly using python open source tools, but some are not, these problematic files output could have many format issues, for example, different order of the text from the pdf view.
Please let me know whether there are some good solutions.
Thanks in advance!