I wrote my book word by word, no AI involved. An editor helped me develop the story and a copy-editor made sure the manuscript was clean. I’ve read my book about a dozen times. Then my layout person gave me the final version of the book and I realized I had to read the whole thing again to check for new errors.
First I did it properly. My eyes were basically blind by the end. But I wanted a second sweep. The thing is, any person asked to do the job will make a mistake. They’ll overlook something. They won’t realize one paragraph is copied over twice or accidentally cut a space between two sentences. What I needed was a perfect sweep. A complete comparison between my original manuscript and the final epub document. The kind of sweep that could only be performed by a soulless machine with an inflexible view of correct and incorrect.
When I’m not writing I’m coding, and this kind of repetitive, detail-oriented, clearly defined task is the perfect fit for a machine. In fact, it was such a perfect fit, the whole process only took an hour.
What Did The Machine Do?
First I defined my requirements. This code was written to spot exactly one type of problem, copy-and-paste mistakes performed by the layout person. It’s not going to spot typos, it’s not going to spot grammar issues, and it’s certainly not going to point out plot holes. This machine is very stupid, but it performs its job to the letter.
Manuscript format: DOCX
Final Book Layout format: EPUB
Goal: Review every sentence in the EPUB and DOCX files and identify any sentence missing from one file that is present in the other, this should capture any omissions, insertions, or errors in the final manuscript. Then, identify if any sentences appear in the same manuscript more than once, this should identify any ‘duplicate chapter’ or ‘duplicate paragraph’ problems.
The complete code will be shown at the end in case you want to use it, but first I’ll walk you through the parts.
Step 1: Parse the DOCX Manuscript
import docx
def extract_text_from_docx(docx_path):
doc = docx.Document(docx_path)
full_text = []
for para in doc.paragraphs:
if para.text.strip(): # skip empty paragraphs
full_text.append(para.text.strip())
return '\n'.join(full_text)
This code is pretty straightforward, it parses the .docx file into paragraphs, joins it all together into one big paragraphless blob.
Step 2: Parse the EPUB Book
This code is almost identical to the DOCX, but EPUB has a lot more nuance to its data-types. We have to ensure we only retrieve the actual text items, and parse them out of html into plain-text. Then we join it all together in one big wall of book.
import ebooklib
from ebooklib import epub
from bs4 import BeautifulSoup
def extract_text_from_epub(epub_path):
book = epub.read_epub(epub_path)
text_content = []
for item in book.get_items():
if item.get_type() == ebooklib.ITEM_DOCUMENT:
soup = BeautifulSoup(item.get_content(), 'html.parser')
# Remove scripts and styles
for tag in soup(['script', 'style']):
tag.decompose()
text = soup.get_text(separator=' ', strip=True)
if text:
text_content.append(text)
return '\n'.join(text_content)
Step 3: Split the book-blobs into sentences
This part uses a tool called the Natural Language Toolkit (NLTK). Sometimes what NLTK considers a sentence is a little funny, like it’ll join two sets of short quotes together. But we cannot allow perfect to be the enemy of good, so as long as NLTK is responsible for both sentence splitting procedures, the final outputs should be identical.
import nltk
from nltk.tokenize import sent_tokenize
nltk.download('punkt')
nltk.download('punkt_tab')
def split_text_into_sentences(text):
return sent_tokenize(text)
Step 3: Data cleanup
You may have noticed some really long character replacement stuff. Turns out the docx parser picks up a few too many newlines and the epub parser likes directional quotes, so all of that gets replaced with nice, consistent sentencing.
def docx_scan():
docx_path = "FILENAME.docx"
text = extract_text_from_docx(docx_path)
sentences: List[str] = split_text_into_sentences(text)
for i, val in enumerate(sentences):
sentences[i] = val.replace('\n', ' ').replace('“', '"').replace('”', '"').replace("‘", "'").replace("’", "'").replace("\'", "'")
return sentences
def epub_scan():
epub_path = 'FILENAME.epub'
text = extract_text_from_epub(epub_path)
sentences: List[str] = split_text_into_sentences(text)
for i, val in enumerate(sentences):
sentences[i] = val.replace('\n', ' ').replace('“', '"').replace('”', '"').replace("‘", "'").replace("’", "'").replace("\'", "'")
return sentences
Step 4: Crawl through the two books
This is a bit of a doozy, but this function essentially crawls through the final book looking for the next sentence in the manuscript. If it doesn’t find it in 10 sentences, it reports the sentence missing and moves on.
Note: The original draft of this post had a different algorithm that failed to account for sentence order. There’s nothing a programmer does more than tinker with their code, but this function is a big improvement on the original, trust me.
def compare_books(manuscript: List[str], final_book: List[str]):
# We sweep through final_book searching for sentences from manuscrpt
book_1_pos: int = 0
book_2_pos: int = 0
while book_1_pos < len(manuscript):
found: bool = False
target_sentence: str = manuscript[book_1_pos]
for sweep_position in range(book_2_pos, book_2_pos+10):
if(sweep_position < len(final_book) and target_sentence == final_book[sweep_position]):
book_1_pos += 1
book_2_pos = sweep_position
found = True
continue
if not found:
book_1_pos += 1
if ' - ' not in target_sentence:
print(target_sentence)
And because of the way the function is written, we can actually crawl through both books the same way.
epub_sentences = epub_scan()
docx_sentences = docx_scan()
# Check the epub file for errors
compare_books(docx_sentences, epub_sentences)
# Check the docx file for errors
compare_books(epub_sentences, docx_sentences)
There are ~8000 sentences in my book. Since the computer reads both copies twice, it’s only about 32,000 operations. A very cheap, less than one second scan for errors.
All the differences are then written out to a file. There were a bunch of false positives. Of the 54 reported omissions, 4 sentences turned out to contain errors, the rest were quirks of the epub format. But finding real errors means it’s working! And it means my layout person did a fantastic job!
Step 5: Check for duplicates
Finally, we do a quick check in both sentence lists for duplicates. The results here reveal my laziness as an author. It turns out I have ~90 non-unique sentences in my book. Most are ‘He said’, ‘She said’, ‘He nodded’, but the strangest one was “Alpha, Golf, Delta, Charlie.” which is a list of squadrons that are referenced in that exact order on two different occasions.
non_unique_docx = set([x for x in docx_sentences if docx_sentences.count(x) > 1])
non_unique_epub = set([x for x in epub_sentences if epub_sentences.count(x) > 1])
print(f"Docx copies: {len(non_unique_docx)}")
print(f"Epub copies: {len(non_unique_epub)}")
I verified that the total number of non-unique sentences was identical in the DOCX and EPUB formats and moved on.
Conclusion
I always felt a little uneasy about the final version of my book. Even when I had been through it myself, I couldn’t be sure I hadn’t overlooked a massive error. I still can’t be completely sure, but there’s something really reassuring about having a machine do a run-through. When precision is the aim, somehow the passionless report of a calculator is more comforting than a thumbs-up from a professional.
Complete File:
import docx
import nltk
from nltk.tokenize import sent_tokenize
import ebooklib
from ebooklib import epub
from bs4 import BeautifulSoup
from typing import List, Set
nltk.download('punkt')
nltk.download('punkt_tab')
def extract_text_from_docx(docx_path):
doc = docx.Document(docx_path)
full_text = []
for para in doc.paragraphs:
if para.text.strip(): # skip empty paragraphs
full_text.append(para.text.strip())
return '\n'.join(full_text)
def split_text_into_sentences(text):
return sent_tokenize(text)
def docx_scan():
docx_path = "YOURFILE.docx"
text = extract_text_from_docx(docx_path)
sentences: List[str] = split_text_into_sentences(text)
for i, val in enumerate(sentences):
sentences[i] = val.replace('\n', ' ').replace('“', '"').replace('”', '"').replace("‘", "'").replace("’", "'").replace("\'", "'")
return sentences
def extract_text_from_epub(epub_path):
book = epub.read_epub(epub_path)
text_content = []
for item in book.get_items():
if item.get_type() == ebooklib.ITEM_DOCUMENT:
soup = BeautifulSoup(item.get_content(), 'html.parser')
# Remove scripts and styles
for tag in soup(['script', 'style']):
tag.decompose()
text = soup.get_text(separator=' ', strip=True)
if text:
text_content.append(text)
return '\n'.join(text_content)
def epub_scan():
epub_path = 'YOURFILE.epub'
text = extract_text_from_epub(epub_path)
sentences: List[str] = split_text_into_sentences(text)
for i, val in enumerate(sentences):
sentences[i] = val.replace('\n', ' ').replace('“', '"').replace('”', '"').replace("‘", "'").replace("’", "'").replace("\'", "'")
return sentences
def compare_books(manuscript: List[str], final_book: List[str]):
# We sweep through final_book searching for sentences from manuscript
book_1_pos: int = 0
book_2_pos: int = 0
while book_1_pos < len(manuscript):
found: bool = False
target_sentence: str = manuscript[book_1_pos]
for sweep_position in range(book_2_pos, book_2_pos+10):
if(sweep_position < len(final_book) and target_sentence == final_book[sweep_position]):
book_1_pos += 1
book_2_pos = sweep_position
found = True
continue
if not found:
book_1_pos += 1
if ' - ' not in target_sentence:
print(target_sentence)
def main():
epub_sentences = epub_scan()
docx_sentences = docx_scan()
compare_books(docx_sentences, epub_sentences)
compare_books(epub_sentences, docx_sentences)
non_unique_docx = set([x for x in docx_sentences if docx_sentences.count(x) > 1])
non_unique_epub = set([x for x in epub_sentences if epub_sentences.count(x) > 1])
print(f"Docx copies: {len(non_unique_docx)}")
print(f"Epub copies: {len(non_unique_epub)}")
if __name__ == '__main__':
main()