Jump to content

Needing help with regex in making ePub files out of archive.org books


epubl

Recommended Posts

I am in the process of making ePub files out-of-copyright scanned books from Archive.org.
Some problems in the scanned text files was that
1. It has a lot of spaces between words. For example, impossible: im possible
2. It doesn't have table of contents
3.
So I would like to create some kind of program or Regex that would automatically fit together words that has been split, and automatcially creates a table of contents.
How might I do this?
Link to comment
Share on other sites

It takes a mind that can understand context in order to figure out which words are meant to be fused and which ones not. It would take an algorithm that understands a language, its grammatical structures and, in some cases, the meaning of words to see if it makes sense to have them joined or separated.

Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
×
×
  • Create New...