This Small Business Innovation Research (SBIR) Phase I project seeks to address the most significant and challenging software need in healthcare: Cohort identification. A cohort is a group of patients with a common medical condition. Cohorts underpin modern medical care, defining treatment algorithms, measuring quality improvement, supporting government initiatives, and representing the core organization for research trials. While manual techniques have been developed to identify a cohort within a healthcare organization's electronic medical record (EMR), all rely on a physician or coder identifying and marking every record for every applicable medical condition. This manual process is inaccurate and only addresses the most common conditions. The suggested novel and revolutionary approach is to use big data techniques, utilizing the detailed unstructured narrative notes recorded on every patient for every encounter in every healthcare institution. The core technology required to extract and make unstructured data usable in healthcare is natural language processing (NLP) combined with coded representations of clinical concepts (ontologies). This proposal brings together industry leading teams and technologies to tackle the greatest data problem in healthcare, which offers a unique opportunity to significantly influence care for decades to come. The broader impact/commercial potential of this project includes creating the foundational infrastructure for the next generation of data-driven healthcare. Just as Google and Yahoo required advanced information extraction and search indexing techniques to make the vast amount of internet data usable, healthcare requires similar enabling technology. The healthcare challenge is even more complex given the multitude of natural language descriptions used by physicians and the complex logic that defines potential cohorts and algorithms. To address these issues, healthcare requires the category of technologies used in Google and Yahoo, but specialized for the healthcare domain. In healthcare, quality improvement requires recognizing at risk cohorts in a population. Missing these cohorts and inadequately treating them can increase mortality by an order of magnitude, as in the case of deep vein thrombosis (DVT) in acute care. For quality measures being implemented by the federal government, defining and identifying cohorts is always the first step of tracking and reporting. Current processes are manual, limited, and inaccurate. By bringing evidence derived from clinical documentation which is created in current workflow to real-time and population based treatment decisions, this intervention will form a foundation for data-driven care, supporting improved outcomes, shorter hospitalizations, and reduced direct medical costs