Survival analysis deals with the prediction of events at a specified time. It deals with the occurrence of an interested event within a specified time and failure of it produces censored observations i.e incomplete observations.

Biological sciences are the most important application of survival analysis in which we can predict the time for organisms eg. when they will multiply to sizes etc.

**Methods used to do survival analysis:**

There are two methods that can be used to perform survival analysis in R programming language:

- Kaplan-Meier method
- Cox Proportional hazard model

#### Kaplan-Meier Method

The Kaplan-Meir method is used in survival distribution using the Kaplan-Meier estimator for truncated or censored data. It’s a non-parametric statistic that allows us to estimate the survival function and thus not based on underlying probability distribution. The Kaplan–Meier estimates are based on the number of patients (each patient as a row of data) from the total number of patients who survive for a certain time after treatment. (which is the event).

We represent the Kaplan–Meier function by the formula:

Here **S(t)** represents the probability that life is longer than **t** with **ti**(At least one event happened), **di** represents the number of events(e.g. deaths) happened in time **ti** and **ni** represents the number of individuals survived up to time **ti**.

**Example:**

We will use the Survival package for the analysis. Using **Lung** dataset preloaded in survival package which contains data of 228 patients with advanced lung cancer from North Central cancer treatment group based on 10 features. The dataset contains missing values so, missing value treatment is presumed to be done at your side before the building model.

`# Installing package ` `install.packages(` `"survival"` `) ` ` ` `# Loading package ` `library(survival) ` ` ` `# Dataset information ` `?lung ` ` ` `# Fitting the survival model ` `Survival_Function ` `=` `survfit(Surv(lung$time, lung$status ` `=` `=` `2` `)~` `1` `) ` `Survival_Function ` ` ` `# Plotting the function ` `plot(Survival_Function) ` |

*chevron_right*

*filter_none*

Here, we are interested in “**time**” and “**status**” as they play an important role in analysis. Time represents the survival time of patients. Since patients survive, we will consider their status as dead or non-dead(censored).

The ** Surv()** function takes two times and status as input and creates an object which serves as the input of

**function. We pass ~1 in**

`survfir()`

**function to ensure that we are telling the function to fit the model on basis of survival object and have an interrupt.**

`survfit()`

** survfit()** creates survival curves and prints number of values, number of events(people suffering from cancer), the median time and 95% confidence interval. The plot gives the following output:

Here, the x-axis specifies “**Number of days**” and the y-axis specifies the “**probability of survival**“. The dashed lines are upper confidence interval and lower confidence interval.

We also have the confidence interval which shows the margin of error expected i.e In days of surviving 200 days, upper confidence interval reaches 0.76 or 76% and then goes down to **0.60 or 60%**.

#### Cox Proportional hazard model

It is a regression modeling that measures the instantaneous risk of deaths and is bit more difficult to illustrate than the Kaplan-Meier estimator. It consists of hazard function **h(t)** which describes the probability of event or hazard **h**(e.g. survival) up to a particular time **t**. Hazard function considers covariates(independent variables in regression) to compare the survival of patient groups.

It does not assume an underlying probability distribution but it assumes that the hazards of the patient groups we compare are constant over time and because of this it is known as “**Proportional hazard model**“.

**Example:**

We will use the Survival package for the analysis. Using **Lung** dataset preloaded in survival package which contains data of 228 patients with advanced lung cancer from North Central cancer treatment group based on 10 features. The dataset contains missing values so, missing value treatment is presumed to be done at your side before the building model. We will be using the cox proportional hazard function ** coxph()** to build the model.

`# Installing package ` `install.packages(` `"survival"` `) ` ` ` `# Loading package ` `library(survival) ` ` ` `# Dataset information ` `?lung ` ` ` `# Fitting the Cox model ` `Cox_mod <` `-` `coxph(Surv(lung$time, lung$status ` `=` `=` `2` `)~., data ` `=` `lung) ` ` ` `# Summarizing the model ` `summary(Cox_mod) ` ` ` `# Fitting survfit() ` `Cox <` `-` `survfit(Cox_mod) ` ` ` `# Plotting the function ` `plot(Cox) ` |

*chevron_right*

*filter_none*

Here, we are interested in “**time**” and “**status**” as they play an important role in analysis. Time represents the survival time of patients. Since patients survive, we will consider their status as dead or non-dead(censored).

The ** Surv()** function takes two times and status as input and creates an object which serves as the input of

**survfir()**

function. We pass ~1 in **function to ensure that we are telling the function to fit the model on basis of survival object and have an interrupt.**

`survfit()`

The Cox_mod output is similar to regression model. There are some important features like age, sex, ph.ecog and wt. loss. The plot gives the following output:

Here, the x-axis specifies “Number of days” and the y-axis specifies “**probability of survival**“. The dashed lines are upper confidence interval and lower confidence interval. In comparison with the Kaplan-Meier plot, the Cox plot is high for initial values and lower for higher values because of more variables in the Cox plot.

We also have the confidence interval which shows the margin of error expected i.e In days of surviving 200 days, upper confidence interval reaches **0.82 or 82%** and then goes down to **0.70 or 70%.**

Note:Cox model serves better results than Kaplan-Meier as it is most volatile with data and features. Cox model is also higher for lower values and vice-versa i.e drops down sharply when the time increases.

## Recommended Posts:

- Principal Component Analysis with Python
- Complexity Analysis of Binary Search
- GRE Data Analysis | Numerical Methods for Describing Data
- GRE Data Analysis | Distribution of Data, Random Variables, and Probability Distributions
- GRE Data Analysis | Counting Methods
- GRE Data Analysis | Methods for Presenting Data
- GRE Data Analysis | Data Interpretation Examples
- Time Series Analysis in R
- Descriptive Analysis in R Programming
- Difference Between Data Mining and Data Analysis
- Principal Component Analysis with R Programming
- Social Network Analysis Using R Programming
- Predictive Analysis in R Programming
- Performing Analysis of a Factor in R Programming - factanal() Function
- Regression Analysis in R Programming
- Perform Probability Density Analysis on t-Distribution in R Programming - dt() Function
- Perform the Probability Cumulative Density Analysis on t-Distribution in R Programming - pt() Function
- Perform the Inverse Probability Cumulative Density Analysis on t-Distribution in R Programming - qt() Function
- Perform Linear Regression Analysis in R Programming - lm() Function
- Linear Discriminant Analysis in R Programming

If you like GeeksforGeeks and would like to contribute, you can also write an article using contribute.geeksforgeeks.org or mail your article to contribute@geeksforgeeks.org. See your article appearing on the GeeksforGeeks main page and help other Geeks.

Please Improve this article if you find anything incorrect by clicking on the "Improve Article" button below.