survivalROC.C about Survival Model Predictive Accuracy and ROC curve
In Survival Model Predictive Accuracy and ROC curve (Heaherty and Zheng 2005), there are 2 examples, the first one is VA lung cancer data and the second one is mayo PBC data.
I want to know how they define markers in these two datasets and how they plot the AUC in the paper with R. A model score is derived using Cox regression with Karnofsky score, age, and cell type. ROC curves are estimated using a varyingcoefficient Cox model with the derived model score as the single predictor.
I am not sure what it means. Can anyone help me to figure it out?
See also questions close to this topic

2 yaxes Dumbbell ggplot2
I am quite new to R and programming in general. So please forgive my ignorance, I am trying to learn.
I have two sets of data and I would like to plot them against each other. Both have 27 rows and 3 columns; one set is called "range" and the other is called "rangePx". Column “Comp” has the different components, column “Min” is the minimum concentration in % and column “Max” is the maximum concentration in %.
I want to make a 2y axis dumbbell plot, with the y axis being the different components and x axis being the concentration.
I do manage to create 1 y axis dumbbell plot, but I have troubles to add the second y axis.
Here is a snap from the "range" data
head(range) # A tibble: 6 x 3 Comp Min Max <chr> <dbl> <dbl> 1 Methane 0.0100 100 2 Ethane 0.0100 65.0 3 Ethene 0.100 20.0 4 Propane 0.0100 40.0 5 Propene 0.100 6.00 6 Propadien 0.0500 2.00
and here is a snap from the "rangePx" data
head(rangePx) # A tibble: 6 x 3 Comp Min Max <chr> <dbl> <dbl> 1 Methane 50.0 100 2 Ethane 0.00800 14.0 3 Ethene 0 0 4 Propane 0.00800 8.00 5 Propene 0 0 6 Propadien 0 0
Here is the piece of code that I use:
library(ggplot2) library(ggalt) library(readxl) theme_set(theme_classic()) range < read_excel(range.xlsx) rangePx < read_excel(rangePx.xlsx") p < ggplot(range, aes(x=Max, xend=Min, y = Comp, group=Comp)) p < p + geom_dumbbell(color="blue") p px < ggplot(rangePx, aes(x=Max, xend=Min, y = Comp, group=Comp)) px < px + geom_dumbbell(color="green") p < p + geom_dumbbell(aes(y=px, color="red")) p
and here is the complain I get when I call
p
:Error: Aesthetics must be either length 1 or the same as the data (27): y, colour, x, xend, group
Here I saw a 6x3 data frame but my original data are 27x3
can anyone help me?
Thnx in advance

Trouble trying to clean a character vector in R data frame (UTF8 encoding issue)
I'm having some issues cleaning up a dataset after I manually extracted the data online  I'm guessing these are encoding issues. I have an issue trying to remove the "U+00A0" in the "Athlete" column cels along with the operator brackets. I looked up the corresponding UTF8 code and it's for "NoBreakSpace". I'm also not sure how to replace the other UTF8 characters to make the names legible  for e.g. getting U+008A to display as Š.
Subset of data
head2007decathlon < structure(list(Rank = 1:6, Athlete = c("<U+00A0>Roman <U+008A>ebrle<U+00A0>(CZE)", "<U+00A0>Maurice Smith<U+00A0>(JAM)", "<U+00A0>Dmitriy Karpov<U+00A0>(KAZ)", "<U+00A0>Aleksey Drozdov<U+00A0>(RUS)", "<U+00A0>Andr<e9> Niklaus<U+00A0>(GER)", "<U+00A0>Aleksey Sysoyev<U+00A0>(RUS)"), Total = c(8676L, 8644L, 8586L, 8475L, 8371L, 8357L), `100m` = c(11.04, 10.62, 10.7, 10.97, 11.12, 10.8), LJ = c(7.56, 7.5, 7.19, 7.25, 7.42, 7.01), SP = c(15.92, 17.32, 16.08, 16.49, 14.12, 16.16), HJ = c(2.12, 1.97, 2.06, 2.12, 2.06, 2.03), `400m` = c(48.8, 47.48, 47.44, 50, 49.4, 48.42), `110mh` = c(14.33, 13.91, 14.03, 14.76, 14.51, 14.59), DT = c(48.75, 52.36, 48.95, 48.62, 44.48, 49.76), PV = c(4.8, 4.8, 5, 5, 5.3, 4.9), JT = c(71.18, 53.61, 59.84, 65.51, 63.28, 57.75), `1500m` = c(275.32, 273.52, 279.68, 276.93, 272.5, 276.16), Year = structure(c(1L, 1L, 1L, 1L, 1L, 1L), .Label = "2007", class = "factor"), Nationality = c(NA, NA, NA, NA, NA, NA)), .Names = c("Rank", "Athlete", "Total", "100m", "LJ", "SP", "HJ", "400m", "110mh", "DT", "PV", "JT", "1500m", "Year", "Nationality"), row.names = c(NA, 6L), class = c("tbl_df", "tbl", "data.frame"))
This is what I've tried so far to no success:
1) head2007decathlon$Athlete < gsub(pattern="\U00A0",replacement="",x=head2007decathlon$Athlete) 2) head2007decathlon$Athlete < gsub(pattern="<U00A0>",replacement="",x=head2007decathlon$Athlete) 3) head2007decathlon$Athlete < iconv(head2007decathlon$Athlete, from="UTF8", to="LATIN1") 4) Encoding(head2007decathlon$Athlete) < "UTF8" 5) head2007decathlon$Athlete< enc2utf8(head2007decathlon$Athlete)

Create a plot from boxplot.stats
Someone sent me a file containing the list of boxplot.stats.
I now want to reproduce and plot this boxplot from the list. (I have stats, n , conf and out).How should I proceed? Can I use plotly for this purpose?

Can´t use survfit on some data.frames
I have a dataset I´m going to use for survival analysis, and it seems to be working fine when I use the whole set. However, once I slice it into smaller dataframes using
data[which(data$variable1=="somevalue")]
the thing seems to break down.Most of the resulting smaller dataframes work fine, but some are a problem. In the problematic ones, I can use
summary(survfit(Surv(time, status)~variable2, data=smalldataframe))$surv
without a problem, but when I trysummary(survfit(Surv(time, status)~variable2, data=smalldataframe), time=5)$surv
, it throwsError in array(xx, dim = dd) : negative length vectors are not allowed
.I´ve tried looking at the data, to see if I have any weird values, like negative times, but there aren´t any. Besides, if there were a problem with that, the full dataframe should be throwing an error too, but it doesn´t. All the smaller dataframes are created using the same line of code, so I also don´t understand why they are acting differently. And mostly, I don´t understand why
summary(survfit(...))$surv
works fine, as doesplot(survfit(...))
, but when I want to calculate survival at a specific time, it suddenly doesn´t like the data anymore.Here´s one of the offending dataframes
test < structure(list(time2 = c(0.15, 2.08, 2.06, 0.32, 39.45, 39.09, 2.57, 3.64, 13.57, 36.57, 36.26, 0.78, 0.1, 33.94, 3.1, NA, 1.77, 28.38, 1.24, NA, 1.87, 25.83, 2.62, 1.57, 1.6, 22.74, 21.03, 20.54, 20.03, 0.97, 19.35, 18.09, 2.61, 17.68, NA, 3.85, 3.52, 11.22, 11.52, 11.04, 10.51, 1.68, 10.4, 10.61, 9.01, 9.05, 7.8, 0.11, 4.83), status = c(1, 1, 1, 1, 1, 0, 1, 1, 0, 0, 0, 1, 1, 0, 1, NA, 1, 1, 1, NA, 1, 0, 1, 1, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, NA, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0), cas_dg = c(1, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 5, 5, 5, 5, 6, 6, 6, 6, 6, 6, 6, 7, 7, 7, 7, 7, 7, 8, 8, 8, 8, 8, 8, 8, 8, 8, 9, 9, 9, 9, 9)), .Names = c("time2", "status", "cas_dg"), row.names = c(NA, 49L), class = "data.frame")
The call that is giving me trouble is
summary(survfit(Surv(time2, status)~cas_dg, data=test), time=5)$surv
and that only with some of the smaller dataframes. 
Propensity score matching using psmatch2 in stata, then running stcox
I have ~4500 exposed people (exposed to drug A) and ~2000 nonexposed people. There is quite a bit of imbalance between the groups, so I have made a propensity score from ~30 variables which cover comorbidities, other drugs, health service utilisation, demographics etc. There is good overlap between the distribution of the PS for both groups.
Now, I want to use this score to match the exposed and the nonexposed. I'm using psmatch2 in stata. I want to do 1:1 matching, with replacement.
psmatch2 exp, outcome(primary) pscore(PS) neighbor(1) caliper(0.22)
Where "primary" is the primary outcome of stroke, "PS" is my precalculated propensity score and the value of 0.22 is (0.2*logit of sd of PS).
This approach uses every single person in my dataset. But because I have so many more exposed than nonexposed, the nonexposed people are being used up to 52 times in matches. I'm not too concerned about this. The approach has given good balance on a selection of some variables; see below.
My question is; I now want to run an stcox model. How do I tell the stcox model who the matches are? For example, for the person who is used 52 times to match onto an exposed person.... stcox has no way of knowing this unless I tell it I'm using a matched dataset. Does anyone know how to do this? I suspect some of the variables generated from psmatch2 may be the answer? Am just unsure of how to use.

Trying to specify predict.coxph type within map2() function
I’ve been scouring the web for the last few days looking at the documentation for map2. I have taken a training set, nested the data and created coxph models for it, saving those models in the nested table. Now I want to predict from that model, but I want to use a type=“expected" as, according to the documentation (R documentation: predict.coxph)
The survival probability for a subject is equal to exp(expected)
I’ve adapted the relevant code to reproduce my issues using the mpg data set.
I have 4 examples below that do not work after the predict function that does work. Please note that I have removed the coxph.null models from this set, so the only models are of class(coxph). This code can be used to replicate the errors.
#Needed libraries library(ggplot2) library(tidyverse) library(purrr) library(broom) library(survival) #Create data set mpg_data < mpg mpg_data < mpg_data %>% mutate(mpg_diff = cty  hwy) mpg_data < mpg_data %>% mutate(EVENT = (mpg_diff >= 8)) set.seed(1) mpg_data < mpg_data %>% mutate(TIME_TO_EVENT = as.integer(runif(234, 1, 100))) mpg_nested < mpg_data %>% group_by(manufacturer) %>% mutate(n_prot = length(model)) %>% nest() # Stepwise regression stepwise < function(data) { response < Surv(time = data$TIME_TO_EVENT, event = data$EVENT, type = "right") full < "Surv(time = data$TIME_TO_EVENT, event = data$EVENT, type = 'right') ~ data$cyl+data$cty+data$hwy+data$displ" x < factor(as.factor(data$model)) full < ifelse(nlevels(x) >= 2, paste(full, "as.character(data$model)", sep = "+"), full) x < factor(as.factor(data$trans)) full < ifelse(nlevels(x) >= 2, paste(full, "as.character(data$trans)", sep = "+"), full) x < factor(as.factor(data$drv)) full < ifelse(nlevels(x) >= 2, paste(full, "as.character(data$drv)", sep = "+"), full) null_model_ONE < coxph(response ~ 1, data=data) full_model_ONE < coxph(as.formula(full), data=data) model_ONE < step(null_model_ONE, scope=list(lower=null_model_ONE, upper=full_model_ONE)) } survival_mpg < mpg_nested %>% mutate(model_fit = map(data, stepwise)) #Predicting values #This works but is not type="expected" survival_mpg_predict < survival_mpg %>% mutate(mpg_predict = map2(model_fit, data, predict)) ##TRY 1## predict.F < function(model_fit, data){ predict(model_fit, newdata=data, type="expected") } survival_mpg_predict < survival_mpg %>% mutate(mpg_predict = map2(model_fit, data, predict.F)) #Error in mutate_impl(.data, dots) : Evaluation error: requires numeric/complex matrix/vector arguments. ##Try 2## survival_mpg_predict < survival_mpg %>% mutate(mpg_predict = map2(model_fit, data, predict(model_fit, newdata = data, type="expected"))) #Error in mutate_impl(.data, dots) : Evaluation error: no applicable method for 'predict' applied to an object of class "list". ##Try 3## survival_mpg_predict < survival_mpg %>% mutate(mpg_predict = map2(model_fit, data, ~ predict(.x, newdata = .y, type="expected"))) #Error in mutate_impl(.data, dots) : Evaluation error: requires numeric/complex matrix/vector arguments. ##Try 4## survival_mpg_predict < survival_mpg %>% mutate(mpg_predict = map2(model_fit, data, function(model_fit, data) predict(model_fit, newdata=data, type="expected"))) #Error in mutate_impl(.data, dots) : Evaluation error: requires numeric/complex matrix/vector arguments.

ROC curve + Statistical significane
I have 3 classifiers (A, B and C) models. The models were tested on a held out test data for which I have the associated ground truth. Using the predictions made by the each model and the ground I was able to plot the ROC curve and compute the Area under the ROC curve. From the curve, I observed model A did better than B and C.
From the predictions made by the models and ground truth, is it possible to show that the performance of model is significantly different from the other in python?

How is the ROC curve plotted in Viola Jones face detection paper?
I am reading paper by Viola and Jones. There they have used ROC curve to measure the accuracy of their classifier.
https://www.cs.cmu.edu/~efros/courses/LBMV07/Papers/violacvpr01.pdf
Could someone please explain how the ROC curve is plotted in case of binary classifier like face or non face? I mean how is the data points obtained.
(X,Y)= (falsepositive, correctdetection rate)
Do I have to calculate these points for every positives and negatives of my training data set. But my positive and negative data sets are of different sizes. I am bit confused.

how to use PRROC package to get the auc of ROC & PR for random forest in R
My data resource:https://www.kaggle.com/mlgulb/creditcardfraud The dataset is highly unbalanced, the positive class (frauds) account for 0.172% of all transactions,
I was using the PRROC package to get AUC of ROC curve, here is my random forest code:
rf.model < randomForest(Class ~ ., data = training, ntree = 2000, nodesize = 20) rf_pred < predict(rf.model, test,type="prob"
so, as expected, rf_pred should return the probability of each class : Then, i used the following code:
fg_rf < rf_pred[test$Class==1] bg_rf < rf_pred[test$Class==0] roc_rf < roc.curve(scores.class0 = fg_rf,scores.class1 = bg_rf,curve = T)
However, the ROC CURVE turned out to be not what as i expected The same problem occurred for PR curve. Is it because of high imbalance in class? And assuming rf_pred returns the probability of 0/1, how can i let fg_rf equals to the probability of calss=1, is my code:
fg_rf < rf_pred[test$Class==1]
correct? 
Measure Accuracy of LSTM for text classification in R
I am using R on windows platform single user desktop version. I have text from twitter which I am using for text classification. I have created a CORPUS of the text and developed a model using LSTM in R. Everything is fine but I am not able to find out the accuracy of the model and also plot ROC and find the AUC value. My model and inference model codes are as follows:
Train the model
model < mx.lstm(X.train, X.val, ctx=mx.cpu(), num.round=num.round, update.period=update.period, num.lstm.layer=num.lstm.layer, seq.len=seq.len, num.hidden=num.hidden, num.embed=num.embed, num.label=vocab, batch.size=batch.size, input.size=vocab, initializer=mx.init.uniform(0.01), learning.rate=learning.rate, wd=wd, optimizer = "sgd", clip_gradient=clip_gradient) infer.model < mx.lstm.inference(num.lstm.layer=num.lstm.layer, input.size=vocab, num.hidden=num.hidden, num.embed=num.embed, num.label=vocab, arg.params=model$arg.params, ctx=mx.cpu())
Please advise as to how I can measure the accuracy of the model and find AUC value and plot ROC curve.
Thanks in advance Vijay Zutshi

How can I plot ROC and obtain AUC in R?
I have the actual label(binary) vector. It is a onedimensional array. I have the test data. It is another onedimensional array.
My classification rule is: if elements of test data vector > 0.5: assign a label 1 otherwise assign 0.
So I get a onedimensional binary array with the predicted label.
How can I plot the ROC and AUC using functions such as ROCR in R ? as these require a prediction model. Thanks !!

Getting perfect ROCAUC score for Linear SVC
I am evaluating different classifiers for my sentiment analysis model. I am looking at all available metrics, and whilst most achieve a similar precision, recall, F1scores and ROCAUC scores, Linear SVM appears to get a perfect ROCAUC score. Look at the chart below:
Abbreviations: MNB=Multinomial Naive Bayes, SGD=Stochastic Gradient Descent, LR=Logistic Regression, LSVC=Linear Support Vector Classification
Here are the rest of the performance metrics for LSVC, which are very similar to the rest of the classifiers:
precision recall f1score support neg 0.83 0.90 0.87 24979 pos 0.90 0.82 0.86 25021 avg / total 0.87 0.86 0.86 50000
As you can see the dataset is balanced for pos and neg comments.
Here is the relevant code:
def evaluate(classifier): predicted = classifier.predict(testing_text) if isinstance(classifier.steps[2][1], LinearSVC): probabilities = np.array(classifier.decision_function(testing_text)) scores = probabilities else: probabilities = np.array(classifier.predict_proba(testing_text)) scores = np.max(probabilities, axis=1) pos_idx = np.where(predicted == 'pos') predicted_true_binary = np.zeros(predicted.shape) predicted_true_binary[pos_idx] = 1 fpr, tpr, thresholds = metrics.roc_curve(predicted_true_binary, scores) auc = metrics.roc_auc_score(predicted_true_binary, scores) mean_acc = np.mean(predicted == testing_category) report = metrics.classification_report(testing_category, predicted) confusion_matrix = metrics.confusion_matrix(testing_category, predicted) return fpr, tpr, auc, mean_acc, report, confusion_matrix
I am using
predict_proba
for all classifiers apart fromLSVC
which usesdecision_function
instead (since it does not have apredict_proba
method`)What's going on?
EDIT: changes according to @Vivek Kumar's comments:
def evaluate(classifier): predicted = classifier.predict(testing_text) if isinstance(classifier.steps[2][1], LinearSVC): probabilities = np.array(classifier.decision_function(testing_text)) scores = probabilities else: probabilities = np.array(classifier.predict_proba(testing_text)) scores = probabilities[:, 1] # NEW testing_category_array = np.array(testing_category) # NEW pos_idx = np.where(testing_category_array == 'pos') predicted_true_binary = np.zeros(testing_category_array.shape) predicted_true_binary[pos_idx] = 1 fpr, tpr, thresholds = metrics.roc_curve(predicted_true_binary, scores) auc = metrics.roc_auc_score(predicted_true_binary, scores) mean_acc = np.mean(predicted == testing_category) report = metrics.classification_report(testing_category, predicted) confusion_matrix = metrics.confusion_matrix(testing_category, predicted) return fpr, tpr, auc, mean_acc, report, confusion_matrix
This now yields this graph:

Violation of PH assumption
Running a survival analysis, sometimes the pvalue regarding a variable is statistically significant  let's say with a positive association with the outcome  but the PH assumption has been violated.
What are the possible scenarios after correcting for PH violations:
 The pvalue may not be significant anymore.
 pvalue still significant, but the size of HR may change.
 pvalue still significant, but the direction of association may be altered (i. e. a positive association may end up being negative).

How to correctly fit Cox Regression with coxphf? (Firth’s penalized maximum likelihood bias reduction method in R)
I am trying to fit a Cox regression (package needed:survival), and have run into a problem. When I try to fit a "regular" Cox regression model with my data, I receive the error message "X matrix deemed to be singular; variable 9" (and if I remove variable 9, the problem becomes variable 8). As far as I understand the problem, this happens because too many patients with these variables have the same event (I believe in another question this was called "perfect classification")
That's why I tried to fit a Cox model with coxphf function (package of the same name), as this should take care of the problem by using "Firth’s penalized maximum likelihood bias reduction method" for the Cox regression. But this also doesn't seem to work, until I increase the "maxit" from the default 50 to 1000 and remove the "undefined" variable from the equation. But if I remove the undefined variable from my dataset (it is only 1 person), the model doesn't seem to work again.
So my question is, how can I solve this? Is it even appropriate/necessary to remove the whole variable (a therefore that 1 person) from the dataset? Probably made some very obvious mistakes, but please bear with me, since I have absolutely no background in statistics. Thank you very much already in advance.
I included the following sample data, as well as my attempts to fit the Cox model. So this is how I managed to get the model to work, by leaving out the "Undefined" variable from the model:
# load packages library("survival") library ("coxphf") example<structure(list(Pat.nr. = c(1L, 2L, 5L, 7L, 8L, 10L, 13L, 14L, 15L, 17L, 18L, 19L, 20L, 21L, 22L, 23L, 26L, 28L, 29L, 30L, 31L, 32L, 33L, 34L, 35L, 36L, 37L, 39L, 41L, 42L, 44L, 45L, 46L, 47L, 48L, 49L, 50L, 52L, 53L, 56L, 57L, 58L, 59L, 60L, 61L, 62L, 63L, 67L, 68L, 69L), OS = c(1.6, 34.6, 1.5, 35.8, 7.7, 38.6, 37.6, 8.6, 0.6, 5.7, 0.6, 43.9, 25.8, 7.3, 28.1, 43.8, 12.8, 18.5, 36.1, 43.1, 15.4, 37.6, 8.6, 2.7, 10.2, 8.1, 37.3, 25.3, 3.7, 26.1, 41.2, 5.9, 15.5, 56.8, 29.5, 52.1, 5.4, 54.8, 53.5, 16.6, 49.2, 53.8, 8.5, 56, 7.4, 28, 3.3, 38, 55.7, 0.4), Event = c(1L, 0L, 0L, 0L, 1L, 0L, 0L, 1L, 1L, 1L, 1L, 0L, 0L, 1L, 1L, 0L, 1L, 0L, 0L, 0L, 1L, 1L, 1L, 1L, 1L, 0L, 0L, 0L, 1L, 0L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 1L, 0L, 1L, 0L, 1L, 1L), Age = c(68.41, 54.9, 44.44, 64.14, 68.86, 62.93, 40.76, 31.06, 42.97, 69.16, 47.39, 60.14, 27.9, 56.57, 19.63, 47.75, 45.58, 66.22, 43.73, 45.34, 38.83, 54.46, 48.91, 70.3, 60.51, 68.55, 63.18, 55.89, 68.27, 57.25, 56.17, 60.83, 74.42, 71.3, 40.36, 50.85, 59.61, 50.14, 45.77, 19.34, 56.32, 53.38, 70.7, 55.25, 56.05, 44.06, 51.36, 69.37, 69.71, 75.44), Favorable = c(0L, 0L, 1L, 0L, 0L, 1L, 0L, 1L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 1L, 0L, 0L, 1L, 1L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L), Intermediate = c(0L, 2L, 0L, 0L, 2L, 0L, 0L, 0L, 0L, 2L, 0L, 2L, 2L, 0L, 2L, 0L, 2L, 2L, 0L, 0L, 0L, 0L, 2L, 0L, 0L, 0L, 2L, 0L, 0L, 2L, 2L, 2L, 0L, 2L, 2L, 0L, 0L, 2L, 0L, 2L, 0L, 0L, 0L, 2L, 2L, 0L, 2L, 0L, 2L, 2L), Adverse = c(1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 1L, 0L, 0L, 1L, 0L, 1L, 0L, 0L, 1L, 1L, 1L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 0L, 0L), Undefined = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L)), .Names = c("Pat.nr.", "OS", "Event", "Age", "Favorable", "Intermediate", "Adverse", "Undefined"), row.names = c(NA, 50L ), class = "data.frame") #row&columns n_row < dim(example)[1] n_col < dim(example)[2] #variables: OS < c(example[1:n_row,2]) Event < c(example[1:n_row,3]) age < c(example[1:n_row,4]) Favorable < c(example[1:n_row,5]) Intermediate < c(example[1:n_row,6]) Adverse < c(example[1:n_row,7]) Undefined < c(example[1:n_row,8]) # dependent and independent variables y < Surv(OS, Event) x < cbind(age, Favorable, Intermediate, Adverse, Undefined) example < data.frame(cbind(x,y)) #coxphf with Firth's Penalized Likelihood > which doesn't seem to work cox2<coxphf(data=example,y~x, firth=TRUE, pl=TRUE, maxit=1000) summary(cox2) #coxphf with Firth's Penalized Likelihood (without Variable "Undefined") > this works cox2<coxphf(data=example,y~age+Favorable+Intermediate+Adverse, firth=TRUE, pl=TRUE, maxit=1000) summary(cox2)
And here, I have modified the dataset to not include the undefined variable (and the model doesn't work anymore):
example1<structure(list(Pat.nr. = c(1L, 2L, 5L, 7L, 8L, 10L, 13L, 14L, 15L, 17L, 19L, 20L, 21L, 22L, 23L, 26L, 28L, 29L, 30L, 31L, 32L, 33L, 34L, 35L, 36L, 37L, 39L, 41L, 42L, 44L, 45L, 46L, 47L, 48L, 49L, 50L, 52L, 53L, 56L, 57L, 58L, 59L, 60L, 61L, 62L, 63L, 67L, 68L, 69L, 72L), OS = c(1.6, 34.6, 1.5, 35.8, 7.7, 38.6, 37.6, 8.6, 0.6, 5.7, 43.9, 25.8, 7.3, 28.1, 43.8, 12.8, 18.5, 36.1, 43.1, 15.4, 37.6, 8.6, 2.7, 10.2, 8.1, 37.3, 25.3, 3.7, 26.1, 41.2, 5.9, 15.5, 56.8, 29.5, 52.1, 5.4, 54.8, 53.5, 16.6, 49.2, 53.8, 8.5, 56, 7.4, 28, 3.3, 38, 55.7, 0.4, 2.8), Event = c(1L, 0L, 0L, 0L, 1L, 0L, 0L, 1L, 1L, 1L, 0L, 0L, 1L, 1L, 0L, 1L, 0L, 0L, 0L, 1L, 1L, 1L, 1L, 1L, 0L, 0L, 0L, 1L, 0L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 1L, 0L, 1L, 0L, 1L, 1L, 1L), Age = c(68.41, 54.9, 44.44, 64.14, 68.86, 62.93, 40.76, 31.06, 42.97, 69.16, 60.14, 27.9, 56.57, 19.63, 47.75, 45.58, 66.22, 43.73, 45.34, 38.83, 54.46, 48.91, 70.3, 60.51, 68.55, 63.18, 55.89, 68.27, 57.25, 56.17, 60.83, 74.42, 71.3, 40.36, 50.85, 59.61, 50.14, 45.77, 19.34, 56.32, 53.38, 70.7, 55.25, 56.05, 44.06, 51.36, 69.37, 69.71, 75.44, 71.05), Favorable = c(0L, 0L, 1L, 0L, 0L, 1L, 0L, 1L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 1L, 0L, 0L, 1L, 1L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 1L), Intermediate = c(0L, 2L, 0L, 0L, 2L, 0L, 0L, 0L, 0L, 2L, 2L, 2L, 0L, 2L, 0L, 2L, 2L, 0L, 0L, 0L, 0L, 2L, 0L, 0L, 0L, 2L, 0L, 0L, 2L, 2L, 2L, 0L, 2L, 2L, 0L, 0L, 2L, 0L, 2L, 0L, 0L, 0L, 2L, 2L, 0L, 2L, 0L, 2L, 2L, 0L), Adverse = c(1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 1L, 0L, 0L, 1L, 0L, 1L, 0L, 0L, 1L, 1L, 1L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 0L)), .Names = c("Pat.nr.", "OS", "Event", "Age", "Favorable", "Intermediate", "Adverse"), row.names = c(NA, 50L), class = "data.frame") #row&columns n_row < dim(example1)[1] n_col < dim(example1)[2] #variables: OS < c(example1[1:n_row,2]) Event < c(example1[1:n_row,3]) age < c(example1[1:n_row,4]) Favorable < c(example1[1:n_row,5]) Intermediate < c(example1[1:n_row,6]) Adverse < c(example1[1:n_row,7]) # dependent and independent variables y < Surv(OS, Event) x < cbind(age, Favorable, Intermediate, Adverse) example < data.frame(cbind(x,y)) # dependent and independent variables y < Surv(OS, Event) x < cbind(age, Favorable, Intermediate, Adverse) example1 < data.frame(cbind(x,y)) #coxphf with Firth's Penalized Likelihood cox2<coxphf(data=example,y~x, firth=TRUE, pl=TRUE, maxit=1000) summary(cox2)