Needing help with regex in making ePub files out of archive.org books

epubl · February 18, 2014

I am in the process of making ePub files out-of-copyright scanned books from Archive.org.

Some problems in the scanned text files was that

1. It has a lot of spaces between words. For example, impossible: im possible

2. It doesn't have table of contents

3.

So I would like to create some kind of program or Regex that would automatically fit together words that has been split, and automatcially creates a table of contents.

How might I do this?

epubl · February 24, 2014

bump

Ingolme · February 24, 2014

It takes a mind that can understand context in order to figure out which words are meant to be fused and which ones not. It would take an algorithm that understands a language, its grammatical structures and, in some cases, the meaning of words to see if it makes sense to have them joined or separated.

epubl · March 1, 2014

Could a regex program be made so that heading 2 or 3 would apply to any line that is separated from other blocks of texts (paragraphs) begins with "chapter"?

Sign In

Needing help with regex in making ePub files out of archive.org books

Recommended Posts

epubl

Link to comment

Share on other sites

epubl

Link to comment

Share on other sites

Ingolme

Link to comment

Share on other sites

epubl

Link to comment

Share on other sites

Create an account or sign in to comment

Create an account

Sign in

Browse

Activity