SBIR-STTR Award

Geographic Information Retrieval for Arabic
Award last edited on: 6/13/2006

Sponsored Program
SBIR
Awarding Agency
NSF
Total Award Amount
$99,900
Award Phase
1
Solicitation Topic Code
-----

Principal Investigator
Andras Kornai

Company Information

MetaCarta Inc

12018 Sunrise Valley Drive Suite 300
Reston, VA 20191
   (937) 521-4200
   N/A
   www.metacarta.com
Location: Multiple
Congr. District: 11
County: Fairfax

Phase I

Contract Number: ----------
Start Date: ----    Completed: ----
Phase I year
2006
Phase I Amount
$99,900
This SBIR Phase I research project by MetaCarta proposes to introduce a novel annotation technique, parallel bootstrapping, to take advantage of the existing data sets in creating high quality training material for toponym extraction and resolution. Information Retrieval (IR) systems that can deal with Arabic already exist, but perform no Geographic Information Retrieval (GIR). As the experience of MetaCarta's users shows, it is practically impossible to retrofit standard keyword-based IR systems to perform GIR at a high level, so the only way to achieve Arabic GIR capability is to start with a GIR system. The availability of a high quality English GIR system makes it possible to address the greatest bottleneck of machine learning projects, the lack of manually truthed training data, by an innovative parallel bootstrap technique. Much of disambiguation, and in general, the extraction of semantic content from text, is still performed by rule-based systems that summarize expert knowledge of a domain. In contrast, MetaCarta employs machine-learning techniques that combine Hidden Markov and Maximum Entropy methods. For Arabic, we propose to restrict the rule-based component to morphological analysis, with later stages, in particular the extraction and disambiguation of toponyms to be performed by systems trained on truthed Arabic text. While plain (untruthed) Arabic text is now available in large quantities, see in particular the Arabic Gigaword corpus produced by the Linguistic Data Consortium (LDC), the amount of tagged material is considerably less, and the detail truth values required for toponym extraction and disambiguation are extremely labor-intensive to create by manual annotation. MetaCarta will use as input the LDC 2004T17 and T18 parallel corpora, running the English side through the existing MetaCarta system to produce the in-depth toponym annotation, and projecting back this annotation on the Arabic side. This technology has broad appeal to customers that have an interest in extending GIR to Arabic documents. Representative customers are highly interested in activities restricted to narrow geographic confines, and many of the documents providing information about Middle Eastern areas of key strategic importance are available only in Arabic. Deploying Arabic GIR would also enable the analysts to more rapidly focus on the relevant documents

Phase II

Contract Number: ----------
Start Date: ----    Completed: ----
Phase II year
----
Phase II Amount
----