ML_stats
===
. [00:00:00] So, hey everyone, I'm, I'm, and only one of the surgical education fellows. Welcome back to Artificial Intelligence for the clinician here, I'm behind the knife. This episode is gonna be about statistics and specifically how to interpret and read the results of a machine learning study.
This will be technical, but hopefully fun and relevant to modern surgery and medicine. New tools will incorporate AI and they already do like radiology interpreters, and we have to be able to understand the evidence to evaluate them and use them just like we do for every other diagnostic tool and test.
So to talk about this topic, we're here today with a couple of the fantastic team members at OHSU. First, I'm gonna welcome Phil Jenkins, a general surgery resident and National Library of Medicine postdoc. Phil, welcome back.
Thanks. Glad to be back. Looking forward to this one.
Joining us for the first time is Shelby Willis, another general surgery resident at OHSU.
Thanks. Excited to be joining the series.
And [00:01:00] we also have Dr. Julie Dover, a cardiac surgeon and informatician at OHSU.
It is a pleasure to be here. I think this is one of the most misunderstood, yet most powerful topics in clinical ai. And until medical schools start consistently teaching interpretation of machine learning studies, clinicians will be ill-equipped to interpret the rapidly evolving science that is coming out.
A hundred percent agree. And lastly, we have once again, Dr. Steven Bedrick, who's a machine learning researcher and associate professor of biomedical informatics.
Happy to be here. This is a fun
one. All right. Well, Phil, Dr. Bedrick, everyone else, let's get right into it. Phil, let's just start off with why can't machine learning models just report an odds ratio, like a good old fashioned regression?
Why do we have to have this discussion?
So at a basic level, machine learning models allow us to look at more complex relationships between variables and more variables. [00:02:00] Instead of one predictor and one outcome, machine learning models can handle hundreds or thousands of variables interacting in non-linear ways.
Taking it one step further. Machine learning even supports combinations of algorithms like in stack progressions or ensemble models. That's why unlike logistic regression, you usually don't get a single odds ratio that neatly summarizes things.
Okay, so no odds ratios, but Dr. Bedrick, can you explain a little bit more?
Well. The reason we can get a nice and interpretable interpretable odds ratio out of logistic regression is because of some of the underlying math, , behind how logistic regression works.
It gets a little bit detailed, , but , it's important to remember that it is just another kind of machine learning model, just one that we're more familiar with. And we use it a lot in medicine logistic regression because it does let us explore the relationship between independent variables or what in machine learning we might call features, and then our dependent variable of interest.
Um, and, and that's really what we're doing when we get an odds ratio from logistic regression. We're, we're exploring that [00:03:00] relationship. Now, , there's a trade off to this though. Machine learning is always about trade offs and, in order for logistic regression to be easy to compute and easy to work with and easy to interpret, there's limits to how complicated of a relationship it can model, , between the predictors and the response variables.
And the other thing we have to give up with logistic regression is when we have a very large number of predictor variables. Or a large number of features, , logistic regression by itself, , can really kind of struggle. , Now there's a lot of other machine learning models out there that make different trade-offs and, , and have different benefits.
So, for example, a very popular family of machine learning model is called a support vector machine. , And on the surface it looks sort of like fancy logistic regression. It, it can accept that it can handle, you know, more robust predictions over larger numbers of features. But to use it, we have to give two things up.
We have to give up, a lot of interpretability,, and we also have to give up some computational simplicity. So the underlying equations behind the support vector machine aren't doing things really the same way as logistic [00:04:00] regression. And so the internals don't map so nicely to something like an odds ratio.
Uh, we can't really interpret them in the same way. And people have worked out different ways to explore the relationship between the features and an SVM and the outcome, , between the parameters of an SVM. And the, uh, outcome variable, but it's not nearly as straightforward as good old logistic regression.
And other times, machine learning models involve just too many moving parts to easily disentangle how they all fit together. Uh, models like this are sometimes referred to as black box models since their , , behavior can be very opaque. This is especially the case with models that use neural networks, like large language models like chat, GPT or the current kinds of image processing models that we use to analyze chest x-rays.
, But because this is such a big, you know, need, being able to understand the relationship between our features and our outcome, um, there's a whole body of research about how to interpret models that are otherwise very hard to interpret. , Such methods, . Tend to, you know, still not get us to a simple odd ratio, but sometimes give us useful things that are helpful for decision [00:05:00] making, sometimes even more than an odds ratio and can be useful in their own way.
And there's a lot of different statistical methods out there, , that we can use to ask questions of black box models. And we'll be talking about some of those a little bit later.
, Yeah. And for example, like some stuff that I work on, there are even models where the goal isn't prediction at all. , Phenotyping models, , group patients based on similar patterns or characteristics.
You're not trying to predict sepsis per se, but you're trying to identify sub of sub groups of substance patients. And the evaluation metrics for phenotyping are completely different from those used in protective modeling, and that's key Machine learning can offer, can often be hypothesis generating instead of hypothesis testing.
So its evaluation isn't the same as testing and association and traditional statistics.
Thank you that, that's a great recap and , before we jump into some specifics on how we talk about model evaluation, can you give one very straightforward example of where logistic regression is just [00:06:00] not good enough for a data, maybe something, , something easier than a chest x-ray.
, when we,, are doing. I say time series prediction, , where we have repeated measurements over a period of time, right? And need to forecast what's going to happen next or the likelihood of something happening. Um, you know, logistic regression is not really set up for that sort of, prediction problem.
Because, in that sort of prediction problem, all of the inputs are correlated with each other. Right, because one follows after the other. That's sort of the point. And logistic regression assumes that its predictors are uncorrelated.
Yeah, no, thank you. So, , it's all really important to understand for what we're talking about and when we talk about model evaluation, one of the first things that we see when we read these papers are that data gets split between training tests, training sets, and test data.
And Shelby, what does that mean?
The basic setup is that you split your data into training and testing sets. The model learns on [00:07:00] one part, and then gets evaluated on unseen data. Then it's evaluated on the test set, which the model hasn't seen before to see how it performs on new data. You can also use things like temporal validation training on data from 2020 and testing it on 2021 data to simulate real world situations.
Right. Once you have a trained model, we need to evaluate its performance. That's where we get into performance metrics. Dr. Dover, maybe we should start with something familiar like sensitivity and specificity.
Yeah. So let's start with some terms. We all know sensitivity and specificity. Sensitivity refers to the likelihood of a positive test, given the true state is positive. And specificity refers to the likelihood that if the true state is negative, the test is negative as well.
Consider when we evaluate a sick patient in the emergency room and order a lactate level. Lactate levels are very sensitive, but very non-specific. It could be elevated in a variety of conditions. [00:08:00] Dehydration, cardiogenic shock, dead gut. Now let's consider the finding of free air under the diaphragm on x-ray.
Assuming it's a non postoperative state error seen under the diaphragm is specific to a perforated viscous, but it's not highly sensitive to where it originated. So now that we have those concepts, let's stretch into some terms that aren't deeply familiar to everybody. Dr. Bedrick, what is an area under the curve?
The area under the curve metric is one of many different ways that we can evaluate how well a binary classifier is working. So it's, , built on top of sensitivity and specificity. But going one more level further, binary classifiers, , usually involve making a choice about a threshold, right? How certain does the algorithm need to be before we accept a denser?
Depending on how we tune that threshold, we can make the model more or less sensitive, and that in turn will have an effect on its [00:09:00] specificity. So those two numbers play off of each other and, , determine where we set and, uh, depend on how we set that threshold. So one, one way to evaluate the behavior of a classifier model is to observe how accurate it is as a wide range of thresholds and then plot it on a curve.
This is called a receiver operating characteristic curve. And the way we make one is we vary the threshold from all the way turned down to cranked up to 11. And at each step we, uh, measure the sensitivity and the specificity of our classifier. That gives us a nice plotted curve and the shape of the resulting curve tells us something about how robust the classifier is.
If the curve is super flat, like if it's sort of following the diagonal, uh, of a, of a, of a. Two way plot. Um, then that indicates worse performance. And the higher and tighter the curve, the closer it is to square the better. So then we can actually quantify this shape by taking the area area under that curve.
And that's sometimes called the A UC or the R-O-C-A-U-C because it's part of the receiver operating characteristic curve. [00:10:00] And the closer that number is to one. The better the classifier and the closer it is to 0.5, the closer it is to just a random flipping a coin random choice classifier. So this is a common way to quickly quantify the behavior of a classifier so as to compare it to another one evaluated on the same data.
Um, and, and there's a whole range of Cisco methods out there for doing things like computing confidence intervals or comparing to a UC scores, um, and getting a p value out. Now, one thing to keep in mind is that ROC curves like this do, uh, work best, meaning the most informative, um, when the ratio of positive and negative examples, it is relatively close to even.
In other words, when your data's not very imbalanced. Just like with diagnostic tests for rare conditions, we need to be more careful about what metrics we use. After all, if the prevalence of our condition of interest in the population is 5%, uh, we can really, really get a a, a very. You know, very accurate tests.
We can get 95% accuracy by just predicting no disease every time. Uh, but that's useless because we missed the 5% we actually care about. [00:11:00] So in practice, we also look at measures like precision and recall. Shelby, uh, why don't you tell us about precision and recall?
Right. Precision refers to the proportion of cases the model predicts is positive, that are actually true positives.
In other words, it's equivalent to the positive predictive value of a diagnostic test. Recall, also known as sensitivity measures the proportion of actual positives that the model successfully identifies. For example, imagine a model that always predicts node disease. If the disease is rare, it might still achieve 95% accuracy, but its recall would be extremely poor.
An F1 score combines precision and recall right.
Exactly F1 score is the harmonic mean of precision and recall. It is the happy place if you care equally about both false positives and false negatives. It gives a single number that balances both
to get more formal. , The F score that we've talked about here, is equally weighting precision and recall, but there's a generalization that lets us tune [00:12:00] that.
So if you have a situation where. You know, it's very, very important that you not miss any true positive cases and you're willing to accept a higher false positive rate. , You can tune your F score to reflect that in your evaluation. , And that can be, that can be useful in practice. That I'll, we don't usually do that though.
It's usually an even waiting. And by now you've probably noticed that we've been throwing around a lot of metrics. I think we're up to maybe six or seven. , And they're all computed from the same two by two table that you would've learned about in epi class. , There's many more that we haven't mentioned.
And the main thing to know about all of them is that there's no single metric on its own that tells you everything you need to know about how, about how your algorithm is performing. Uh, just like for a diagnostic test, we always talk about them in terms of sensitivity, specificity, and positive predictive value.
It's very, not very common to only talk about one of those metrics. We do turn to F1, , because it is informative. People really do like having a single number to argue about. , And, , an F1 is frequently used as that number. But it is important to remember that it's never giving the whole picture. And your neighborhood, your statistician or [00:13:00] machine learning expert almost certainly has their favorite obscure metric that no one's ever heard of, that is like the right metric for every use case.
Um, yeah, I know, I know somebody who swears by the Matthews correlation coefficient and he's the only person who's ever used it. , And you know, so a fun party game is asking people in a room like what their favorite class classifier accuracy metric is, and then watching the, watching the fun. We have, we have good, we have good parties and statistics.
Yeah, I, I, I was a math major and we had many math party tricks. Ours were probably just as exciting. So, I will say even, even now when I do, uh, some studies, it's not uncommon that I'll write a two by two table and, and even then try to figure out as I'm reading a paper what exactly they're talking about.
, But thank you all for the fantastic overview because even though these are basic terms that a lot of us are familiar with. , It's important to understand because they come into play in every single machine learning paper that we read, and the ones that you think you don't know, you can probably get to with that two by two table.
And, , for those that don't know what we're talking about, it's the [00:14:00] true positive, , true, true negative, false positive, false negative square. Now, these terms will get you. To be able to read most of these results. But next what we're gonna do is go into some advanced metrics and we're gonna get into some deeper concepts, which get a little bit more complicated.
So stay tuned if you're interested in that. Phil, let's start by saying that we look at a traditional regression. It's really easy for me to tell which variables are important. The odds ratio is very intuitive. For example, you take an odds ratio to develop lung cancer and active smokers. And you hear that it's 20, it's really not hard for most people to interpret that.
And, and, and most people can kind of think of things that way. What's the analogous in machine learning, Phil? So how, , how do we think about what's the most important variable in a machine learning model?
Yeah. When building machine learning model, one of the first things we do is decide which variables we want the model to pay attention to these variables are called features, as we've said before. In healthcare features might be things like age, [00:15:00] blood pressure, or lab values. Once the model is trained, we often want to understand which of these features it relies the most on to make its predictions.
That's where feature importance comes in. It tells us which variables are influencing the model's decisions the most. For simpler models like logistic regression, it might look like an odds ratio, like you said, but more complex models. We need more advanced tools like shop value, so that shapely additive explanations shop assigns an important.
Importance value to each feature for each individual prediction instead of just knowing a variable is generally important. Shap can show us this specific patient, these particular variables push the model towards prediction, predicting condition x.
Thanks, Phil. Now, Shelby, what are some challenges that we are faced with when using real world data collected at a lot of academic centers
when working with machine learning, especially in medicine, one [00:16:00] challenge we often face is having a small data set with limited data. It's easy for a model to perform well on one set of patients, but fail on new ones because it hasn't seen enough variety.
That's why techniques like K fold cross validation are so important. Instead of splitting the data just once into training and testing sets, K fold, cross validation splits it into multiple parts. The model trains and tests across these different subsets, helping us get a more reliable sense of its true performance, even with fewer cases.
Another technique is bootstrap resampling. This involves taking repeated samples from your dataset with replacement to estimate things like confidence intervals for model metrics. Cross validation and bootstrap resampling help make the most outta a small dataset by allowing you to reuse the save data in structured ways to build more trustworthy models.
Now another way to think about that is, is it's , just another set of methods for understanding how stable our model is and getting a remote, more robust statistical understanding of how accurate it might really be. Um, you [00:17:00] know, with real data. Another issue that can come up with data sets we work with in medicine is that their small size can be a little bit misleading in that we might have a fairly small number of patients in a cohort, but we might have a very large amount of data about all of those patients.
So imagine if you tried to fit all of your patients' data from a, you know, heart surgery into an Excel spreadsheet. You'd have one row for your one patient, and then you might have an awful lot of columns for all the medications and diagnoses and procedures and who even knows what else. Um, and. Many machine learning models just don't work too well in this situation when we've got really, you know, what we would call a very wide data set as opposed to a taller data set.
, In terms of, you know, way more variables than we have, um, examples, and one of the many ways to deal with this kind of scenario is to use statistical procedures to try to identify a subset of your variables. To get it back into a range, , into a ratio of, of examples to variables that you can work with.
, And, uh, you might do that by using a procedure to iteratively try different ones and pick the important ones and discard the unimportant ones. [00:18:00] Another option is to get rid of all the variables and replace them with a small number of new and better ones. This is sometimes called dimensionality reduction, and uh, which refers to a whole family of techniques that look at your whole data set with all of its predictor variables and try to find a new set of variables, , that keeps as much of the same information.
, But is is smaller. If you've ever used principle components analysis or PCA, that's a simple example of this kind of algorithm. It can accurately capture certain kinds of relationships among a large number of variables and computes a new set of variables. Which it calls components, and they each represent different parts of the variance in your original data.
Now, , the cool thing about PCA is those components are as sorted by how much of the variance they capture, which means that you can pick the first couple and now you've made an approximation of your larger data set, , with, you know, but, but instead of having hundreds of columns, now you have four or five or 10.
, This will make your, , uh, spreadsheet if you will, a lot taller than it is wide. And now that'll let you [00:19:00] back, uh, into a world of being able to use, , some of the simpler models that you might've wanted to use in the first place, , that couldn't handle that situation. But remember, machine learning's all about trade-offs.
By doing this, you've given up some of the information that was in your dataset. And that might affect how well the model works. And it, , that impact might not be evenly distributed across your input examples, right? There could be some patients that had uncommon things happening that that maybe fell out, , from the dimensionality reduction.
So you always want to, um, check on that kind of thing. There's a whole lot of different algorithms like this to choose from. , You may have heard of Disney or Umap, but there's many others. They all make different kinds of trade-offs and are used in different kinds of situations. And as a fun dimensionality reduction fact, if you ever saved a jpeg.
Listen to an MP three or watch a YouTube video. You've seen dimensionality reduction in action. The data compression that those systems all use are all doing the same thing as, , any dimensionality reduction algorithm and that they're taking a big, complicated data set and identifying a informative subset that can be kept around.[00:20:00]
Thank you. Thank you for that overview. And I just wanna point out that when you're reading a paper, when we were talking about feature importance in shap values, , what that might look like in the paper, the graph that they'll show almost looks like a tornado diagram. And you'll see, you'll see these bars next to each feature, and you can interpret this the, uh, relative size of those bars as how important that feature is.
So when you see that graph in a machine learning model, that's what you're looking at now. About the dimensionality reduction. Those graphs are a little bit more complicated. You may have seen a umap, um, graph before and it looks like a bunch of very fancy squares, and we won't get too much into how to interpret it.
But the point is just that what they're showing is how they're reducing these dimensions of data. , Now. I think just to wrap up with the usual, , for everybody, what should we be concerned about or cautious about when it comes to, reading and interpreting machine learning and AI studies, especially in a clinical [00:21:00] setting?
First, over fitting a model might look great on training data, but perform terribly on new patients. That's why external validation is so important.
Secondly would be interpretability. If you can't understand why a model is making predictions, it's harder to trust or act on it, especially in medicine.
And then third calibration. This means whether the predicted probability is match reality. If your model says 80% chance of readmission, it should actually happen 80% of the time.
Many models fail here, so it's very important to ask questions about this issue when someone's trying to tell you that they have an amazingly accurate model. , And I'll also definitely second Shelby's call for. , External validation. Just because it worked on their patient population doesn't mean it'll work on your patient population.
And a final critical warning is that real life has infinitely more variables than what can be captured in these models. So, as we have said numerous times in this series already, the decision to act on the study findings lies firmly in the hands of the clinicians directly working with the [00:22:00] patients.
I think one, one additional pitfall I will add is that, , you always have to make sure that you understand who the model is trained on in terms of the patient population, and you have to understand what the outcomes that they predict are.
, My favorite example of that is, uh, especially radiographic, , machine learning models. They usually don't. Try and predict every single possible outcome, and you have to understand the few that they're actually trying to tell you about. Uh, and so if there's one part of the machine learning study, you should definitely read.
It's what, what did we try to predict and who did we predict it on? Phil, what are your final comments on the take home message?
I mean, my big takeaway is that machine learning is a really powerful tool, but like any other clinical tool, , we need to be able to evaluate it carefully and hopefully we helped you guys, , be able to do that today.
Exactly. And clinicians don't need to become data scientists, but we do need to understand enough to ask a [00:23:00] questions.
All right, well. That is it for this episode of Artificial Intelligence for the clinician thank you to everybody, Phil Shelby, Dr.
Dover, and Dr. Bedrick. Next time we'll tackle ethical questions and bias in AI models, which will be, much, much more fun and less technical discussion. , Until then, from behind. The knife dominate the day. The dominate the day.
We recommend upgrading to the latest Chrome, Firefox, Safari, or Edge.
Please check your internet connection and refresh the page. You might also try disabling any ad blockers.
You can visit our support center if you're having problems.