There is a reason business is booming for us and has been for years—automated ebook conversions suck big time. Okay, maybe "big time" is pushing it a bit too far, after all, the automation does still save countless hours of manual labor. The problem is, until PDF to Word converters adopt a fairly good artificial intelligence algorithm, a human being will be needed to review the entire document and make corrections where a computer cannot. In this article, I will show you what you need to know when converting your PDF file to a Word document (.doc or .docx) file.
This article assumes that you already have a Word document that was created from a .pdf file. If you are not sure, read on!
Perhaps you convert PDF to Word documents on occasion and never had much of a problem. If this is the case, I can virtually guarantee you that the PDF files you are working with are PDF files made from editable document files (such as Word) with very few advanced layout features (i.e., callouts, wrapped images, etc.) and not PDF files made from scanned images. When you save a Word doc as a PDF file, there is far less of a loss in information, meaning that reverse conversion from that PDF back to the Word document will still have some issues, but issues that are not too difficult to address, and thus a relatively painless experience. But creating a PDF from scanned book is like taking a photograph of each page. The software interprets the page as an image and not text. To understand the image as text, OCR (optical character recognition) software must be run on the image to interpret the image as text. Assuming a clean scan of the pages, even the best OCR software at 99.9% accuracy will screw up 1 out of 1000 words. In a 100,000 word book, this means you will have 100 messed up words! Not very professional, and quite a nightmare.
At the time of this writing, OCR software used to convert scans into text do not contain enough AI (artificial intelligence) to have a good contextual understanding of words. Therefore, if the image looks like an "iv" to the software, it will interpret as "iv" even though in context it might be "We ivill succeed and we will prosper!" This is not a real brain-buster for humans—not even an 8-year-old one. Yet machines struggle and usually fail. Fortunately, this is an error that any decent spell checker would pick up since "ivill" is not a recognized word. But many errors are recognized words or they are in names that are ignored by the spell checker.
Another reason machines fail is because of poor quality scans/images, small text, unorthodox fonts, and generally not being able to recognize letters from its rather limited library of knowledge on how to recognize letters. This is where the human mind excels. This failure on the machine's part is the reason that form spam software works so well (often referred to as "Captcha"). It is (usually) easy for the human eye to detect the characters but virtually impossible for machines.
Now that you have your Word document that was created from a PDF here is what you need to do in addition to the standard formatting that you would otherwise do for Word document before converting it to an ebook. Let me stress that you should read every word in the document to ensure it is correct. If you were scanning hundreds of books for free public access, this level of proofing would clearly be an overkill, but if this is your book that you are selling online (i.e., people are paying money for), you owe it to your readers to ensure they are buying an error-free (or virtually error-free) book.
If the document is a real mess, we often use what we call the "nuclear" option to remove all the formatting. We call it this because it's like nuking a city and starting over from scratch. What you will have is a plain text document with all of the words and none of the formatting (you still need to fix the errors with the incorrect words). Here is the process:
PDF to Word conversions do not have to be a nightmare, even if from a scanned source. It does take time, however. If you are willing to put in the time, you can have a wonderful looking and working document ready to be converted to an ebook. If you're not willing to put in the time or deal with the many issues that can arise from a PDF to Word conversion and would rather pay someone to deal with this, well, that is why we're in business!