Bookstore Book Marketing and Publishing Services
Sales: (in US) 1-855-326-6548 or (outside US) 978-440-8364 ext. 0 or Schedule an Appointment

Blog Home Tags:
convert from pdf
pdf to doc
pdf to word
scanned pdf
scanning books
Print

PDF To Word: Extremely Important Conversion Tips

image loading... by Bo Bennett, PhD, Founder of eBookIt
posted Tuesday Jun 28, 2016 12:27 PM

image loading...

Bo Bennett, PhD

Founder of eBookIt

About Bo Bennett, PhD

I started eBookIt.com back in 2010, because, as an author, I was frustrated with the lack of options for e-publishing. We have helped thousands of clients publish their books over the years, and we are looking forward to helping thousands more.

There is a reason business is booming for us and has been for years—automated ebook conversions suck big time. Okay, maybe "big time" is pushing it a bit too far, after all, the automation does still save countless hours of manual labor. The problem is, until PDF to Word converters adopt a fairly good artificial intelligence algorithm, a human being will be needed to review the entire document and make corrections where a computer cannot. In this article, I will show you what you need to know when converting your PDF file to a Word document (.doc or .docx) file.

This article assumes that you already have a Word document that was created from a .pdf file. If you are not sure, read on!

PDF to Word: Relatively Painless Experience... or Nightmare

Perhaps you convert PDF to Word documents on occasion and never had much of a problem. If this is the case, I can virtually guarantee you that the PDF files you are working with are PDF files made from editable document files (such as Word) with very few advanced layout features (i.e., callouts, wrapped images, etc.) and not PDF files made from scanned images. When you save a Word doc as a PDF file, there is far less of a loss in information, meaning that reverse conversion from that PDF back to the Word document will still have some issues, but issues that are not too difficult to address, and thus a relatively painless experience. But creating a PDF from scanned book is like taking a photograph of each page. The software interprets the page as an image and not text. To understand the image as text, OCR (optical character recognition) software must be run on the image to interpret the image as text. Assuming a clean scan of the pages, even the best OCR software at 99.9% accuracy will screw up 1 out of 1000 words. In a 100,000 word book, this means you will have 100 messed up words! Not very professional, and quite a nightmare.

pdf to word conversion example

Why Machines Fail and Humans Are Needed

At the time of this writing, OCR software used to convert scans into text do not contain enough AI (artificial intelligence) to have a good contextual understanding of words. Therefore, if the image looks like an "iv" to the software, it will interpret as "iv" even though in context it might be "We ivill succeed and we will prosper!" This is not a real brain-buster for humans—not even an 8-year-old one. Yet machines struggle and usually fail. Fortunately, this is an error that any decent spell checker would pick up since "ivill" is not a recognized word. But many errors are recognized words or they are in names that are ignored by the spell checker.

Captcha Image Showing How PDF to DOC Converts FailAnother reason machines fail is because of poor quality scans/images, small text, unorthodox fonts, and generally not being able to recognize letters from its rather limited library of knowledge on how to recognize letters. This is where the human mind excels. This failure on the machine's part is the reason that form spam software works so well (often referred to as "Captcha"). It is (usually) easy for the human eye to detect the characters but virtually impossible for machines.

Proofing Your PDF to Word Conversion

Now that you have your Word document that was created from a PDF here is what you need to do in addition to the standard formatting that you would otherwise do for Word document before converting it to an ebook. Let me stress that you should read every word in the document to ensure it is correct. If you were scanning hundreds of books for free public access, this level of proofing would clearly be an overkill, but if this is your book that you are selling online (i.e., people are paying money for), you owe it to your readers to ensure they are buying an error-free (or virtually error-free) book.

  • Look for incorrect words. Often OCR and even the standard PDF to Word conversion algorithms will misinterpret two letters close to each other that look like another letter. For example, "Li" can be seen as "U". Once you find one of these errors, it might be worth it to do a global search and replace. So you might want to replace all instances of "Ught" with "Light" (since "Ught" is not a word).
  • Fix line breaks. PDF to Word converters are notorious for not knowing where line breaks are supposed to go, and putting them in places where they don't belong. One of the best ways to detect these line breaks is by turning on the "show invisibles" option, or changing the font size.
  • Fix hyphenated words. If a word is hyphenated because of being split on two lines, the pdf to Word software generally does not know if the hyphen needs to be there or not, so keeps it. So a word like "insti-tution" might appear on one line, which is not something you want.
  • Fix multiple spaces. You will find words separated by multiple spaces all throughout the document. To get rid of these, use find and replace. Start with finding 20 spaces and replacing with one space, then 19, then 18, and so on.
  • Missing formatting. OCR often misses bold and italic formatting, as well as mixed upper and lower case.

Go Nuclear

If the document is a real mess, we often use what we call the "nuclear" option to remove all the formatting. We call it this because it's like nuking a city and starting over from scratch. What you will have is a plain text document with all of the words and none of the formatting (you still need to fix the errors with the incorrect words). Here is the process:

  1. Open up your Word document and choose "select all" from the "Edit" menu.
  2. Open up a plain text file using Notepad, TextEdit, or other plain text editor.
  3. Paste all into the plain text editor.
  4. If you clearly have many line breaks where they should not be, do a global search and replace for all line breaks and replace them with a space. Depending on your OS and text editor, the way to do this will vary (google it!).
  5. Reconstruct your document using the physical book or PDF scanned source as a visual guide.

PDF to Word conversions do not have to be a nightmare, even if from a scanned source. It does take time, however. If you are willing to put in the time, you can have a wonderful looking and working document ready to be converted to an ebook. If you're not willing to put in the time or deal with the many issues that can arise from a PDF to Word conversion and would rather pay someone to deal with this, well, that is why we're in business!


Private, Anonymous Comment On This Post (no login required)Your comment below will be anonymously sent to the post owner, it will not be posted, and you will not get a response. To make a public comment, post below (login required).

Send Comment sending comment...

Registered User Comments


 

We work with authors and publishers to format and convert books into ebooks, and distribute them to all the major ebook retailers including Amazon, Apple, B&N, Google Play, Ingram, Kobo, Scribd, Baker and Taylor, and our own bookstore. We then work with our clients to promote their books to the media and the public.


 

As great as ebooks are, as an author, there is nothing like seeing the result of perhaps years of work materialize in a physical book. Holding your book in your hands is one of the greatest feelings of accomplishment any author can experience. This is just one reason print on demand books are as popular as they are.


 

Imagine your book enjoyed by listeners all over the world, distributed by Audible, the world's leading retailer of digital audio books, through outlets such as Audible.com, Amazon, and iTunes. With the help of eBookIt.com, we can make this possible.


 

Book marketing and book promotion are real challenges for virtually all self-published authors. Competing for readers is not easy. But we have good news. We have been doing this for over ten years and have a process down that works extremely well.


Privacy Policy
 Website Software Copyright 2022, Your Web Empire Corp. 

Component Viewer

A component is the HTML code for a section of a webpage that can be combined with other components to make a complete webpage. Click the component to insert the component code at the bottom of your current page, then customize it.