The ability to rapidly spot named entities (NEs) such as persons, organizations, and locations in Arabic document image data is of strategic and tactical importance. An NE extraction system that performs this task faces numerous challenges. These include dealing with images representing both handwritten and character text, images where Arabic and Romanized scripts are mixed, and images of poor quality. Indeed, experiments on combined character recognition (CR) and NE extraction systems show that NE extraction performance degrades twice as fast as CR performance as more noise is introduced into the input images. The goal of this project is to develop a high-accuracy CR and NE extraction system whose input consists of images of Arabic text. Our approach is to perform CR and NE in a pipeline, with the CR component passing multiple best hypotheses to the NE extraction system. Joint inference over these multiple hypotheses are performed using $k$-best or approximate inference methods, improving overall system accuracy.
Keywords: Information Extraction, Named Entity Recognition, Approximate Inference, Particle Filtering, Character Recognition, Handwriting Recognition, Document Image Processing, Pattern