Последнее обновление: 25.10.2006   Статьи / Постобработка


A Syntax Directed Level Building Algorithm for Large Vocabulary Handwrriten Word Recognition Авторы: Alessandro L. Koerich, Robert Sabourin, Ching Y. Suen, Abdenaim El-Yacoubi
Организация: CEDAR, Montreal, Canada CENPARMI, Montreal, Canada Pontifica Universidade Catolica do Parana, Brazil
Дата: 2000-2003 год
Кол-во страниц: 12
This paper descripbes a large vocabluary handwritten word recognition system based on a syntax-directed level building algorithm (SDLBA) that incorporates contextual information. The sequences of observations extractedfrom the input images are matches against the entries of a tree-structure lexicon. where each node is represented by a 10-state character HMM. The search proceeds breadth-first and each node is decoded by the SDLBA. Contextual information about writing styles and case transitions are injected between the levels of the SDLBA. An implementation of the SDLBA together with 36,100-entry lexicon is described. In terms of recognition speed, the results show that the SDLBA together with a three-structured lexicon outperforms a baseline system that uses a Viterbi-flat-lexicon scheme while maintainng the same accuracy and consuming a reasonable amount of memory.
 Скачать файл (176 Кб)

Approximate String Matching Algorithms for Limited-Vocabulary OCR Output Correction Авторы: Thomas A. Lasko, Susan E. Hauser
Организация: Gundersen Lutheran Medical Center, La Crosse Lister Hill National Center for Biomedical Communications, National Library of Medicine, Bethesda
Дата: 2001-2003 год
Кол-во страниц: 9
Five methods for matching words mistranslated by optical character recognition to their most likely match in a reference dictionary were tested on data from the archives of the National Library of Medicine. The methods, including an adaptation of the cross correlation algorithm, the generic edit distance algorithm, the edit distance algorithm with a probabilistic substitution matrix, Bayesian analysis, and Bayesian analysis on an actively thinned reference dictionary were implemented and their accuracy rates compared. Of the five, the Bayesian algorithm produced the most correct matches (87%), and had the advantage of producing scores that have a useful and practical interpretation.
 Скачать файл (31 Кб)

Correcting OCR Text by Association with Historical Datasets Авторы: Susan Hauser, Jonathan Schlaifer, Tehseen Sabir, Dina Demner-Fushman, George Thoma
Организация: Lister Hill National Center for Biomedical Communications, National Library of Medicine, Bethesda
Дата: 2003-2005 год
Кол-во страниц: 10
The Medical Article Records System (MARS) developed by the Lister Hill National Center for Biomedical Communications uses scanning, OCR and automated recognition and reformatting algorithms to generate electronic bibliographic citation data from paper biomedical journal articles. The multi-engine OCR server incorporated in MARS performs well in general, but fares less well with text printed in small or italic fonts. Affiliations are often printed in small italic fonts in the journals processed by MARS. Consequently, although the automatic processes generate much of the citation data correctly, the affiliation field frequently contains incorrect data, which must be manually corrected by verification operators. In contrast, author names are usually printed in large, normal fonts that are correctly converted to text by the OCR server. The National Library of Medicine’s MEDLINE® database contains 11 million indexed citations for biomedical journal articles. This paper documents our effort to use the historical author, affiliation relationships from this large dataset to find potential correct affiliations for MARS articles based on the author and the affiliation in the OCR output. Preliminary tests using a table of about 400,000 author/affiliation pairs extracted from the corrected data from MARS indicated that about 44% of the author/affiliation pairs were repeats and that about 47% of newly converted author names would be found in this set. A text-matching algorithm was developed to determine the likelihood that an affiliation found in the table corresponding to the OCR text of the first author was the current, correct affiliation. This matching algorithm compares an affiliation found in the author/affiliation table (found with the OCR text of the first author) to the OCR output affiliation, and calculates a score indicating the similarity of the affiliation found in the table to the OCR affiliation. Using a ground truth set of 519 OCR author/OCR affiliation/correct affiliation triples, the matching algorithm is able to select a correct affiliation for the author 43% of the time with a false positive rate of 6%, a true negative rate of 44% and a false negative rate of 7%. MEDLINE citations with United States affiliations typically include the zip code. In addition to using author names as clues to correct affiliations, we are investigating the value of the OCR text of zip codes as clues to correct USA affiliations. Current work includes generation of an author/affiliation/zipcode table from the entire MEDLINE database and development of a daemon module to implement affiliation selection and matching for the MARS system using both author names and zip codes. Preliminary results from the initial version of the daemon module and the partially filled author/affiliation/zipcode table are encouraging.
 Скачать файл (194 Кб)

Greek Alphabet Recognition Technique for Biomedical Documents Авторы: Daniel X. Le, Scott R. Straughan, George R. Thoma
Организация: National Library of Medicine, Bethesda
Дата: 2001-2003 год
Кол-во страниц: 7
Most current commercial optical character recognition (OCR) systems can accurately recognize the text in documents written in a single language. However, when dealing with Greek characters embedded in predominantly English text, these systems do not perform well, and most OCR systems do not recognize the characters as belonging to the Greek alphabet. As a result, the degree of manual review required to validate and correct OCR errors is high. To handle this problem, we propose a new technique based on features calculated from the output of multiple OCR systems, and combined with string pattern matching and document content analysis to improve the recognition of both Greek characters and regular text. Our proposed technique uses two passes of a document page image through OCR systems that use different recognition languages. Experiments carried out on a sample of medical journals show the feasibility of using the proposed technique for Greek character recognition. Preliminary evaluation conducted on a sample of medical journal page images shows that our approach improves the recognition of Greek characters embedded within predominantly English language text.
 Скачать файл (50 Кб)

OCR Correction Using Historical Relationships from Verified Text in Biomedical Citations Авторы: Susan Hauser, Tehseen Sabir, George Thoma
Организация: Lister Hill National Center for Biomedical Communications, National Library of Medicine, National Institutes of Health, Department of Health and Human Services, Bethesda
Дата: 2004-2005 год
Кол-во страниц: 7
The Lister Hill National Center for Biomedical Communications has developed a system that incorporates OCR and automated recognition and reformatting algorithms to extract bibliographic citation data from scanned biomedical journal articles to populate the NLM’s MEDLINE® database. The multi-engine OCR server incorporated in the system performs well in general, but fares less well with text printed in the small or italic fonts often used to print institutional affiliations. Because of poor OCR and other reasons, the resulting affiliation field frequently requires a disproportionate amount of time to manually correct and verify. In contrast, author names are usually printed in large, normal fonts that are correctly recognized by the OCR system. We describe techniques to exploit the more successful OCR conversion of author names to help find the correct affiliations from MEDLINE data.
 Скачать файл (195 Кб)

Организация: Department of Electrical Engineering University of Missouri – Columbia Department of Computer Engineering and Computer Science University of Missouri – Columbia
Дата: 2000 год
Кол-во страниц: 10
Word level training refers to the process of learning the parameters of a word recognition system based on word level criteria functions. Previously, researchers trained lexicon-driven handwritten word recognition systems at the character level individually. These systems generally use statistical or neural based character recognizers to produce character level confidence scores. In the case of neural networks, the objective functions used in training involve minimizing the difference between some desired outputs and the actual outputs of the network. Desired outputs are generally not directly tied to word recognition performance. In this paper, we describe methods to optimize the parameters of these networks using word level optimization criteria. Experimental results show that word level discriminative training without desired outputs not only outperforms character level training but also eliminates the difficulty of choosing desired outputs. The method can also be applied to all segmentation based handwritten word recognition systems.
 Скачать файл (42 Кб)

Word Lexicon reduction by character spotting Авторы: Didier Guillevic, Daisuke Nishiwaki, Keiji Yamada
Организация: Computer And Communication Media Research, NEC Corporation, Japan
Дата: 2000 год
Кол-во страниц: 10
We describe a system, currently under development, to dynamically reduce the lexicon of city names, making use exclusively of the information found in a word image. Isolated characters are 'spotted' within the word. The recognition results on those isolated characters are then used to initialize a Hidden Markov Model (HMM) like module to dynamically reduce lexicon.
 Скачать файл (124 Кб)

Active handwritten Word Recognition Авторы: Jaehwa Park and Venu Govindaraju
Организация: CEDAR, Montreal, Canada
Дата: 2000 год
Кол-во страниц: 10
An active word recognition paradigm using recursive recognition processing is proposed. To achieve successfull recognition result with minimum requiered processing effort, recursive system architecture which active combination of a recognition engine and a decision making module is introduced. In the proposed model, a closed loop connection between recognizer and decision maker operates recursively with successive upgrades of recognition accuracy. The recursion can eventualy reach a satisfactory terminal condition or a rejection state of exhaustive use of all the resources. The proposed model is implementated in a segmentation based lexicon driven word recognition application and experiments show enhanced recognition results.
 Скачать файл (64 Кб)

Pattern matching techniques for correcting low confidence OCR words in a known context Авторы: Glenn Ford, Susan Hauser, Daniel X. Le, George R. Thoma
Организация: National Library of Medicine, Bethesda
Дата: 2001 год
Кол-во страниц: 9
A commercial OCR system is a key component of a system developed at the National Library of Medicine for the automated extraction of bibliographic fields from biomedical journals. This 5-engine OCR system, while exhibiting high performance overall, does not reliably convert very small characters, especially those that are in italics. As a result, the “affiliations” field that typically contains such characters in most journals, is not captured accurately, and requires a disproportionately high manual input. To correct this problem, dictionaries have been created from words occurring in this field (e.g., university, department, street addresses, names of cities, etc.) from 230,000 articles already processed. The OCR output corresponding to the affiliation field is then matched against these dictionary entries by approximate string-matching techniques, and the ranked matches are presented to operators for verification. This paper outlines the techniques employed and the results of a comparative evaluation.
 Скачать файл (1341 Кб)

Поддержите проект материально!

рублей Яндекс.Деньгами
на счёт 41001275511292 (cr-online.ru)

Вы используете мой ресурс? Буду вам очень благодарен, если накоплю хотя бы на оплату хостинга.