SBIR-STTR Award

New Methods To Reduce Bias And Mean Square Error Of Maximum Likelihood Estimators
Award last edited on: 8/8/14

Sponsored Program
SBIR
Awarding Agency
NIH : NIGMS
Total Award Amount
$1,102,852
Award Phase
2
Solicitation Topic Code
-----

Principal Investigator
Pralay Senchaudhuri

Company Information

Cytel Software Corporation (AKA: Cytel, Inc)

675 Massachusetts Avenue #3
Cambridge, MA 02139
   (617) 661-2011
   info@cytel.com
   www.cytel.com
Location: Single
Congr. District: 05
County: Middlesex

Phase I

Contract Number: 1R43RR023228-01
Start Date: 7/1/09    Completed: 12/31/09
Phase I year
2009
Phase I Amount
$106,212
Logistic regression is the most frequently used model for binary data and has widespread applicability in the health, behavioral, and physical sciences. Over two thousand research papers were published in 1999 in which "logistic regression" was in the title of the paper or among the keywords. Maximum likelihood is the nearly universal method for computing estimates of regression coefficients in logistic regression models. These estimates are reliable for problems with large samples and when the proportion of responses is neither too small nor too large. However, it has been known for several years that maximum likelihood estimates can have high bias and mean square error for small, sparse or unbalanced datasets, with the latter referring to a considerable difference between the number of responses and non-responses. Exact logistic regression is a method invented by D. R. Cox that is often useful in such situations. However, exact logistic regression is computationally intensive and is limited in practice in terms of the size of datasets and the number of covariates that it can handle before running out of memory or taking an inordinate amount of computing time. D. Firth has developed a method for reducing bias and mean square error for logistic regression as well as other generalized regression models that is not as computationally demanding. Studies in the literature have shown that the method often improves on maximum likelihood. Firth's method is not available in any commercial software package today. We propose to incorporate Firth's method into LogXact, Cytel's regression package, as well as into PROC LOGXACT, a module that runs seamlessly as a part of the SAS software system. In addition to incorporating Firth's method for logistic regression we intend to develop it to apply to conditional logistic regression, ordered and unordered polytomous regression, Poisson regression and Negative Binomial regression. Firth's method does not perform well over certain ranges of model parameters in moderate sized samples in logistic regression. There are instances when it is worse than maximum likelihood. We have created a novel method that generalizes Firth's method to overcome this shortcoming. We propose to implement this method into LogXact and PROC LOGXACT. Under certain unusual conditions both maximum likelihood and Firth's method produce poor estimates for logistic regression. We have developed a diagnostic measure that identifies this situation and we will incorporate this method as part of our generalization of Firth's method. We will also investigate a Bayesian estimator and the target estimator suggested by Cabrerra and Fernholz that have promise of performing well in this situation.

Project Terms:
Address; Algorithms; Code; Coding System; Computer Programs; Computer Programs and Programming; Computer software; Data; Data Set; Dataset; Diagnostic; Engineering; Engineerings; Epidemiologist; Health Sciences; Literature; Logistic Regressions; Markov Chains; Markov Process; Maximum Likelihood Estimate; Measures; Memory; Methods; Modeling; Monte Carlo Method; Paper; Phase; Procedures; Programs (PT); Programs [Publication Type]; Publishing; Research; Running; Sample Size; Sampling; Software; Time; Writing; behavioral health; case control; computer code; computer program; computer program/software; computer programming; design; designing; improved; novel; physical science; programs; prototype; response; software systems

Phase II

Contract Number: 9R44GM104597-02A1
Start Date: 7/1/09    Completed: 7/31/14
Phase II year
2012
(last award dollars: 2013)
Phase II Amount
$996,640

Categorical outcomes are ubiquitous in biomedical research, and generalized linear models (GLMs) represent the most widely applied methodology for testing associations between categorical variables and fixed investigative factors. Logistic regression in particular is the most frequently used model for binary data and has widespread applicability in the health, behavioral, and physical sciences. King and Ryan (2002) stated that there were 2,770 research papers published in 1999 in which "logistic regression" was in the title of the paper or among the keywords. King and Zeng (2001) referred to the use of the maximum likelihood method in logistic regression as "the nearly universal method". Maximum likelihood estimates (MLE) for logistic regression are based on large sample approximations that are reliable for problems with large samples and when the proportion of responses is not too small or too large. However, it has been known for several years that MLE are not reliable for small, sparse or unbalanced datasets, with the latter referring to a considerable difference between the number of zeros and ones of the response variable. Recent research has suggested a flexible means of correcting MLE bias and improving performance using a penalized likelihood-based approach, but the underlying theory has not been fully applied and implemented for practical use. In this project, we will extend the work begun during Phase 1 with logistic regression by (1) implementing the bias correction approach for a variety of other GLM's that include Poisson, multinomial, negative binomial, and censored survival data; (2) provide new diagnostic procedures that identify potential problems with near separability and MLE bias; (3) implement and evaluate an exact target estimation approach for bias correction in logistic regression; (4) improve the computational algorithms required for Aims 1-3; and (5) additionally implement the procedures in a SAS PROC. Given the ubiquity of categorical regression in public health and biomedical research, the final product of this effort will provide a critical intermediate alternative when analyzing data for which standard large-sample methods are unreliable and small-sample exact methods are infeasible.

Public Health Relevance:
Generalized linear models (such as logistic regression) for categorical data have widespread applicability in the health sciences. Maximum likelihood, the nearly universal method for computing estimates in generalized linear regression models, has been known to have high bias and mean square error for small, sparse or unbalanced datasets. We propose to develop commercial software that incorporates several new methods that have lower bias and mean square error in logistic regression and other generalized linear models and Cox proportional hazard models.

Public Health Relevance Statement:
Generalized linear models (such as logistic regression) for categorical data have widespread applicability in the health sciences. Maximum likelihood, the nearly universal method for computing estimates in generalized linear regression models, has been known to have high bias and mean square error for small, sparse or unbalanced datasets. We propose to develop commercial software that incorporates several new methods that have lower bias and mean square error in logistic regression and other generalized linear models and Cox proportional hazard models.

Project Terms:
Algorithms; base; Behavioral Sciences; Biomedical Research; Computational algorithm; Computer software; Cox Proportional Hazards Models; Data; Data Analyses; Data Set; Development; Diagnostic Procedure; flexibility; Health Sciences; improved; Industry; interest; Linear Models; Linear Regressions; Link; Logistic Regressions; Logistics; Maximum Likelihood Estimate; Measures; Methodology; Methods; Modeling; novel diagnostics; Outcome; Paper; Performance; Phase; physical science; Probability; Procedures; Proportional Hazards Models; public health medicine (field); Publishing; Research; Research Personnel; response; Sample Size; Sampling; Small Business Innovation Research Grant; Statistical Models; Technology; Testing; theories; tool; Work