Learn Survival Analysis with Stata: Concepts, Methods and Examples
An Introduction to Survival Analysis using Stata (Revised Edition).pdf 1
Survival analysis is a branch of statistics that deals with the analysis of time-to-event data. It is widely used in various fields such as medicine, biology, engineering, economics and sociology. In this article, we will introduce the basic concepts and methods of survival analysis using Stata, a powerful statistical software that offers many features and options for handling survival data. We will also discuss how to interpret and report the results of survival analysis in Stata, as well as how to deal with some common issues and challenges that may arise in this type of analysis.
An Introduction to Survival Analysis using Stata (Revised Edition).pdf 1
What is survival analysis?
Survival analysis is a set of techniques for modeling and analyzing the time until an event of interest occurs. The event can be anything that marks the end of an observation period, such as death, failure, recovery, relapse or graduation. The time until the event is called the survival time or the failure time. Survival analysis aims to answer questions such as:
What is the probability of surviving beyond a certain time point?
What factors affect the survival time or the risk of experiencing the event?
How do different groups or treatments compare in terms of survival or risk?
Some examples of survival analysis applications are:
Estimating the median lifespan of patients with a certain disease or condition
Comparing the effectiveness of different drugs or therapies on prolonging survival
Identifying the risk factors for developing a disease or experiencing a recurrence
Evaluating the reliability or durability of a product or a system
Analyzing the duration of unemployment or marriage
Why use Stata for survival analysis?
Stata is a general-purpose statistical software that offers a comprehensive suite of tools for data management, analysis and visualization. It has many advantages for performing survival analysis, such as:
It supports various types of survival data, including right-censored, left-censored, interval-censored and exact data
It provides a wide range of commands and options for fitting different models of survival data, such as parametric models, semi-parametric models and non-parametric models
It allows for flexible specification of covariates, including categorical variables, continuous variables, interactions, polynomials and splines
It produces informative and customizable output, including summary statistics, graphs, coefficients, tests and diagnostics
It has a user-friendly interface and a clear syntax that make it easy to learn and use
It has a large and active user community that offers support, guidance and resources for learning and troubleshooting
However, Stata also has some limitations for survival analysis, such as:
It does not support some complex models of survival data, such as frailty models, joint models and multistate models
It does not handle some types of data structures, such as clustered data, repeated events data and panel data
It may have computational issues when dealing with large or high-dimensional data sets
It may require additional packages or commands to perform some advanced or specialized analyses
How to perform survival analysis in Stata?
To perform survival analysis in Stata, we need to follow three main steps: data preparation, model fitting and result interpretation. In this section, we will briefly introduce the basic commands and options for each step.
Data preparation
The first step of survival analysis in Stata is to prepare the data in a suitable format and structure. The data should have at least two variables: one for the survival time and one for the censoring indicator. The survival time variable can be either continuous or discrete, depending on the nature of the data. The censoring indicator variable can be either binary or categorical, depending on the type of censoring. The data may also have other variables for covariates or explanatory factors that may affect the survival time or the risk of the event.
The most common command for preparing survival data in Stata is stset, which defines the survival time variable, the censoring indicator variable and other options for the analysis. For example, suppose we have a data set called cancer.dta, which contains information on 500 patients with breast cancer. The variables are:
id: patient identification number
time: time from diagnosis to death or last follow-up (in months)
status: censoring indicator (0 = alive, 1 = dead)
age: age at diagnosis (in years)
treat: treatment group (0 = control, 1 = experimental)
stage: cancer stage (1 = early, 2 = advanced)
To prepare the data for survival analysis, we can use the following command:
stset time, failure(status) id(id)
This command tells Stata that the survival time variable is time, the censoring indicator variable is status, where 1 indicates failure (death) and 0 indicates right-censoring (alive), and the identification variable is id. Stata will then display some summary information about the survival data, such as the number of observations, failures, censored observations, total time at risk and mean survival time.
Commands and options
The second step of survival analysis in Stata is to fit a model to the survival data using an appropriate command and option. Stata provides a variety of commands and options for different types of models and methods of survival analysis. Some of the most commonly used commands are:
stsum: displays summary statistics of survival data, such as mean and median survival time, hazard rate and survival function
sts graph: plots graphs of survival data, such as Kaplan-Meier curve, log-rank test, Nelson-Aalen estimator and more
stcox: fits a Cox proportional hazards model to survival data, which is a semi-parametric model that assumes a constant hazard ratio across different groups or covariates
streg: fits a parametric regression model to survival data, which assumes a specific distribution for the baseline hazard function, such as exponential, Weibull, lognormal or log-logistic
stcrreg: fits a competing risks regression model to survival data, which accounts for multiple types of events that may prevent the occurrence of the event of interest
How to interpret and report survival analysis results in Stata?
The third step of survival analysis in Stata is to interpret and report the results of the model fitting using the output generated by the command and option. Stata produces various types of output for different models and methods of survival analysis, such as summary statistics, graphs, coefficients and tests. In this section, we will briefly explain how to interpret and report some of the most common types of output.
Summary statistics
Summary statistics are numerical measures that describe the main features of the survival data or the fitted model. Some of the most common summary statistics are:
Mean and median survival time: the average and the middle value of the survival time distribution
Hazard rate: the instantaneous rate of failure at a given time point
Survival function: the probability of surviving beyond a given time point
To display summary statistics of survival data in Stata, we can use the stsum command. For example, after preparing the data using stset, we can use the following command to display summary statistics by treatment group:
stsum treat
This command will produce a table that shows the number of observations, failures, censored observations, total time at risk, mean and median survival time, hazard rate and survival function for each level of treat. We can interpret and report these summary statistics as follows:
The mean survival time for the control group was 36.52 months, while for the experimental group it was 42.15 months. The median survival time for the control group was 28 months, while for the experimental group it was 38 months. The hazard rate for the control group was 0.04 per month, while for the experimental group it was 0.03 per month. The survival function for the control group was 0.50 at 28 months, while for the experimental group it was 0.50 at 38 months.
Graphs
Graphs are visual representations that illustrate the patterns or relationships of the survival data or the fitted model. Some of the most common graphs are:
Kaplan-Meier curve: a plot of the survival function against time for different groups or covariates
Log-rank test: a statistical test that compares the equality of survival functions across different groups or covariates
Cox-Snell residuals: a plot of the transformed survival time against the cumulative hazard function to assess the fit of a parametric model
Schoenfeld residuals: a plot of the scaled residuals against time to assess the proportional hazards assumption of a Cox model
To plot graphs of survival data in Stata, we can use the sts graph command with various options. For example, after fitting a Cox model using stcox, we can use the following command to plot a Kaplan-Meier curve with a log-rank test by treatment group:
sts graph, by(treat) logrank
This command will produce a graph that shows two curves representing the survival functions for each level of treat, along with a p-value for the log-rank test. We can interpret and report this graph as follows:
The Kaplan-Meier curve shows that the experimental group had a higher survival probability than the control group over time. The log-rank test indicates that there was a statistically significant difference between the two groups in terms of survival (p = 0.002).
Coefficients and tests
Coefficients and tests are numerical estimates and inferential statistics that measure the effects or associations of covariates on survival time or risk. Some of the most common coefficients and tests are:
Hazard ratio: the ratio of hazard rates between two groups or covariates, which indicates how much one factor increases or decreases the risk relative to another factor
Log-likelihood ratio test: a statistical test that compares the fit of two nested models, which indicates whether adding or removing covariates improves or worsens the model fit
Wald test: a statistical test that evaluates the significance of individual or joint coefficients, which indicates whether a covariate has a nonzero effect or association on survival time or risk
To display coefficients and tests of survival models in Stata, we can use the output generated by the model fitting commands, such as stcox or streg. For example, after fitting a Cox model using stcox, we can use the following command to display the coefficients and tests:
stcox age treat stage
This command will produce a table that shows the coefficients, standard errors, z-scores, p-values and hazard ratios for each covariate, along with the log-likelihood, the number of observations and the log-likelihood ratio test. We can interpret and report these coefficients and tests as follows:
The Cox model shows that age, treatment and stage were significant predictors of survival time. The hazard ratio for age was 1.03, which means that for every one-year increase in age, the risk of death increased by 3%. The hazard ratio for treatment was 0.65, which means that the experimental group had a 35% lower risk of death than the control group. The hazard ratio for stage was 2.21, which means that the advanced stage group had a 121% higher risk of death than the early stage group. The log-likelihood ratio test indicates that the model with these three covariates fit significantly better than the null model with no covariates (p
How to handle common issues and challenges in survival analysis in Stata?
Survival analysis in Stata may encounter some issues and challenges that require special attention or treatment. Some of the most common issues and challenges are:
Missing data: when some observations have missing values for survival time, censoring indicator or covariates
Time-varying covariates: when some covariates change their values over time during the observation period
Competing risks: when there are multiple types of events that may prevent the occurrence of the event of interest
In this section, we will briefly introduce how to handle some of these issues and challenges in Stata using some additional commands or options.
Missing data
Missing data can cause bias or inefficiency in survival analysis if not handled properly. There are different methods for dealing with missing data, depending on the mechanism and pattern of missingness. Some of the most common methods are:
Listwise deletion: excluding observations with any missing values from the analysis
Imputation: replacing missing values with plausible values based on some assumptions or models
Sensitivity analysis: performing multiple analyses with different assumptions or methods to assess the robustness of the results
To handle missing data in Stata, we can use different commands or options depending on the method we choose. For example, suppose we have a data set called cancer2.dta, which contains some missing values for treat and stage. To perform listwise deletion, we can use the missing option with stset, as follows:
stset time, failure(status) id(id) missing(treat stage)
This command will exclude observations with missing values for treat or stage from the analysis. To perform imputation, we can use the mi command suite, which offers various methods for imputing missing values, such as single or multiple imputation, regression or predictive mean matching. For example, to perform multiple imputation using chained equations with five imputations, we can use the following commands:
mi set mlong mi register imputed treat stage mi impute chained (regress) treat stage = age status time, add(5)
This command will create five imputed data sets with plausible values for treat and stage, based on a regression model with age, status and time. To perform sensitivity analysis, we can use the mim prefix with any survival analysis command, which will apply Rubin's rules to combine the results from each imputed data set. For example, to fit a Cox model using multiple imputation, we can use the following command:
mim: stcox age treat stage
the results using Rubin's rules to obtain the pooled coefficients and tests.
Time-varying covariates
Time-varying covariates are covariates that change their values over time during the observation period. For example, a patient's blood pressure or weight may vary over time after diagnosis. Time-varying covariates can cause problems in survival analysis if not accounted for properly, as they may violate the assumptions of some models or methods, such as the proportional hazards assumption of the Cox model.
To handle time-varying covariates in Stata, we can use different commands or options depending on the type and structure of the covariates. Some of the most common commands or options are:
stsplit: splits the survival data into multiple records according to the time points when the covariates change their values
tvc and tvc() options: specify time-varying covariates and their interactions with time in a Cox model
stcurve: plots predicted survival curves for different values of time-varying covariates in a parametric model
For example, suppose we have a data set called cancer3.dta, which contains a time-varying covariate called bp, which is the blood pressure of the patients at different time points. To split the data according to bp, we can use the following command:
stsplit bp, at(120 140 160)
This command will create a new variable called _t0, which indicates the start of each interval defined by bp, and split the data into four records for each patient, corresponding to bp , 120 , 140 and bp >= 160. To fit a Cox model with bp as a time-varying covariate and its interaction with time, we can use the following command:
stcox age treat stage, tvc(bp) tvc(bp*time)
This command will include bp and bp*time as additional covariates in the Cox model, which will allow for varying effects of bp on survival time over time. To plot predicted survival curves for different values of bp in a Weibull model, we can use the following command:
streg age treat stage bp, dist(weibull) stcurve, survival at1(bp = (120 140 160))
This command will fit a Weibull model to the data and then plot three survival curves for bp = 120, bp = 140 and bp = 160, holding other covariates at their means.
Competing risks
Competing risks are multiple types of events that may prevent the occurrence of the event of interest. For example, a patient with cancer may die from causes other than cancer, such as heart disease or accident. Competing risks can cause bias or inconsistency in survival analysis if not accounted for properly, as they may affect the estimation of survival probabilities or hazard rates.
To handle competing risks in Stata, we can use different commands or options depending on the model or method we choose. Some of the most common commands or options are:
stcrreg: fits a competing risks regression model to survival data, which estimates the cause-specific hazard rates and cumulative incidence functions for each type of event
cifplot: plots cumulative incidence curves for different types of events or groups in survival data
cendiff: performs statistical tests for comparing cumulative incidence functions across different types of events or groups in survival data
For example, suppose we have a data set called cancer4.dta, which contains a variable called cause, which indicates the cause of death for the patients who died. The possible values are 1 for cancer, 2 for heart disease and 3 for other causes. To fit a competing risks regression model with age, treat and stage as covariates, we can use the following command:
stcrreg age treat stage, fail(cause)
This command will estimate the cause-specific hazard rates and cumulative incidence fun