Pydf2Txt

Pydf2Txt is a scientific tool able to convert [scientific or regular] pdf documents to Plain Text. Pydf2Txt uses Natural Language Processing (NLP), Machine Learning (ML) and some heuristics to cleanly convert the pdf content to text cleanly. The general idea is that Pydf2Txt extract and parse useful information to use as the input of another A.I. tools and get optimal results because we provide practical and clean content.

Pros:
  • Detect sections
  • Detect two column texts
  • Clean headers and footers
  • Fix hyphenations
  • Fix line breaks i.e. wrap lines wisely
ToDo:
  • Detect images
  • Parse tables
Results aren't perfect but better than other similar tools.