Google Uses OCR Technology To Index Scanned Documents

"In the past, scanned documents were rarely included in search results as we couldn't be sure of their content. We had occasional clues from references to the document—so you might get a search result with a title but no snippet highlighting your query. Today that changes," wrote Evin Levey, a product manager at Google, in a blog post on the company's Web site.

Scanned documents are more difficult to index than documents saved as PDFs, according to Google, because scans might include the ring of a coffee cup, ink smudges or fold creases in the paper. "This technology lets us convert a picture (of a thousand words) into a thousand words—words that can be searched and indexed so that these valuable documents are more easily found," Levey wrote.

"To people reading these documents, the distinction between words and pictures of words makes little difference, but for a computer the picture is almost unintelligible. Consider a circle. Should it be read it as a zero, the letter 'O', just a circle, or the ring from my coffee cup? People learn to answer this kind of question very quickly, but for the computer it is a painstaking and error-prone process."

id
unit-1659132512259
type
Sponsored post