Survival analysis

thumbnail of Allison_SurvivalAnalysis Original by P.D. Allison, 2012, 12 pages 

This summary note was Posted on

  • Very similar to linear or logistic regression models except that the dependent variable is a measure of the timing or rate of event occurrence
  • Most method of survival analysis require that the event time be measured with respect to some origin time
  • Ideally the origin time is the same as the time at which observations begin and most software program assume that it is the case
  • Might need to take into account late entry or left truncation
  • Censoring is endemic to survival data
  • Any report of survival analysis should discuss the type, cause and treatment of censoring
  • Most common type of censoring is right censoring when an observation is terminated before an individual experiences an event
  • Censoring could be informative if it occurs at varying time because individuals drop out of the study
  • Slightly less common type of censoring is interval censoring when the exact time not not known, only between two point in time
  • If you know the exact time at which an event occurs, use  methods that treat time as continuous
  • If not use discrete method (like when you only know the month or the year of the event)
  • For discrete method you must choose between a logit model and a complementary log-log model but in practice the choice is usually not consequential
  • Logit is more appropriate for truly discrete events
  • The most popular method for regression analysis  of survival data is the Cox regression
  • Cox regression is semi parametric
  • However parametric methods are much better  at handling left censoring or interval censoring and can generate predicted times to events
  • One major difference between survival regression and conventional linear regression is the possibility of time dependent covariates
  • If the data contain information on more than one event for each individual then special methods are needed to take advantage of the additional information
  • Repeated events provide more statistical power
  • Likely to be statistical dependence among those observations
  • There are four methods to provide correction for repeat events 1) Robust standard errors (Huber-white or sandwich estimates 2) Generalised estimating equation (GEE) 3) Random effect (mixed) models 4)Fixed effect methods
  • Stata will estimate random effects models for Cox regression but SAS wont
  • If event times are discrete, maximum likelihood estimation requires that models are estimated simultaneously suing the generalized logit model (no equivalent for log-log)
  • Conventional wisdom has it that there should be at least 5 (some say 10) events for each parameter in the model in order for max likelihood estimates to have reasonably good properties
  • Imputing values from random draws from the predictive distribution of the missing value. Generate several dataset (5 or more) each with slightly different imputed values. Then combine into a single set of parameters estimates
  • For survival analysis imputation should only be done on the predictor variables. Cases on dependent variable should just be deleted
  • Compare not nested models with AIC, SBC or BIC 
  • Preference is given to models with the lowest values of those statistics, although no p-values can be calculated
  • Magnitudes of beta coefficients (hazard ratios) are difficult to interpret
  • Hazard ratios (always positive) are confusing because a value of 1 means no effect
  • The numeric value as a more straight forward value  100(HR-1)/100 is the percentage change in the hazard for one unit increase in the predictor
  • Hazard ratios are asymmetric no can not use standard errors. Report 95% confidence levels instead
  • Other stats can be chi-square test for the null hypothesis that all coefficients are zero