epubl Posted February 18, 2014 Share Posted February 18, 2014 I am in the process of making ePub files out-of-copyright scanned books from Archive.org. Some problems in the scanned text files was that 1. It has a lot of spaces between words. For example, impossible: im possible 2. It doesn't have table of contents 3. So I would like to create some kind of program or Regex that would automatically fit together words that has been split, and automatcially creates a table of contents. How might I do this? Link to comment Share on other sites More sharing options...
epubl Posted February 24, 2014 Author Share Posted February 24, 2014 bump Link to comment Share on other sites More sharing options...
Ingolme Posted February 24, 2014 Share Posted February 24, 2014 It takes a mind that can understand context in order to figure out which words are meant to be fused and which ones not. It would take an algorithm that understands a language, its grammatical structures and, in some cases, the meaning of words to see if it makes sense to have them joined or separated. Link to comment Share on other sites More sharing options...
epubl Posted March 1, 2014 Author Share Posted March 1, 2014 Could a regex program be made so that heading 2 or 3 would apply to any line that is separated from other blocks of texts (paragraphs) begins with "chapter"? Link to comment Share on other sites More sharing options...
Recommended Posts
Create an account or sign in to comment
You need to be a member in order to leave a comment
Create an account
Sign up for a new account in our community. It's easy!
Register a new accountSign in
Already have an account? Sign in here.
Sign In Now