Classification Trees: What They Are and an Example from a Clinical Domain
Outline
An Example(from Winston, Artificial Intelligence, 3rd ed.)
Beach DataName Hair Height Weight Lotion ResultSarah blonde avg light no sunburnedDana blonde tall avg yes noneAlex brown short avg yes noneAnnie blonde short avg no sunburnedEmily red avg heavy no sunburnedPete brown tall heavy no noneJohn brown avg heavy no noneKatie blonde short light yes none
How to Classify New Cases?
Can Conveniently Arrange Tests in a Tree Format
Classification Tree: Definition
Classification Tree: Definition, continued
Classification TreesA.K.A.
Another Possible Tree for Beach Data (No Prior Sunburn Knowledge)
Occam’s Razor, Specialized toClassification Trees
How to Construct the Smallest Classification Tree?
Understanding the Disorder Formula
Example 1
Example 2
Extending Measure of Disorder in One Set to All Sets of a Branch
Deciding Amongst Possible Tests
Split on Height
Split on Weight
Split on Lotion
Greedy Selection of Tests
Repeat Partitioning for Subsets Containing More than One Class
Overfitting Data
Tree Simplification
Tree Pruning
Tree Pruning, continued
Pruning Example(Quinlan, C4.5 Programs for Machine Learning, p.38)
Predicting Error Rates
Error Rate Prediction Methods
Using a New Set of Cases
Using Only the Training Set from Which the Tree was Built
From Trees to Rules
Beach Example: Trees to Rules
Rule Simplification
Rule Simplification, continued
Application in a Clinical Domain
Motivation
Motivation, continued
Decision Aids for Diagnosis of MI
Previous Work
Limitations of Previous Work
Goal
Roadmap
Methods: Data Collection
Patient Attributes Collected
Listing of Patient Attributesage smoker ex-smokerfamily history of MI diabetes high blood pressurelipids retrosternal pain chest pain major symptomleft chest pain right chest pain back painleft arm pain right arm pain pain affected by breathingpostural pain chest wall tenderness sharp paintight pain sweating shortness of breathnausea vomiting syncopeepisodic pain worsening of pain duration of painprevious angina previous MI pain worse than prev. angina crackles added heart sounds hypoperfusionheart rhythm left vent. hypertrophy left bundle branch blockST elevation new Q waves right bundle branch blockST depression T wave changes ST or T waves abnormalold ischemia old MI sex
Final Diagnosis
Tree Building: Splitting of Data
Tree Building: Specifics
Confidence Level
Tree Comparisons
Logistic Regression Model Building
Logistic Regression Comparisons
Performance Metrics
Sensitivity
Specificity
Positive Predictive Value
Accuracy
Receiver Operating Characteristic (ROC) Curve
ROC Curves: Details
ROC Curves, continued
Results
PPT Slide
ST elevation = 1: 1 (40.7/49.0 = 83.1%)ST elevation = 0:| New Q waves = 1: 1 (4.1/7.0 = 58.6%)| New Q waves = 0:| | ST depression = 0: 0 (329.4/345.0 = 95.5%)| | ST depression = 1:| | | Old ischemia = 1: 0 (3.2/6.0 = 53.3%)| | | Old ischemia = 0:| | | | Family history of MI = 1: 1 (6.8/11.0 = 61.8%)| | | | Family history of MI = 0:| | | | | age <= 61 : 1 (4.0/8.0 = 50.0%)| | | | | age > 61 :| | | | | | Duration of pain (hours) <= 2 : 0 (14.1/22.0 = 64.1%)| | | | | | Duration of pain (hours) > 2 :| | | | | | | T wave changes = 1: 1 (7.0/10.0 = 70.0%)| | | | | | | T wave changes = 0:| | | | | | | | Right arm pain = 1: 0 (3.4/5.0 = 68.0%)| | | | | | | | Right arm pain = 0:| | | | | | | | | Crackles = 0: 0 (3.0/8.0 = 37.5%)| | | | | | | | | Crackles = 1: 1 (4.9/9.0 = 54.4%)
ST elevation = 1: MIST elevation = 0:| New Q waves = 1: MI| New Q waves = 0:| | ST depression = 0: not MI | | ST depression = 1:| | | Old ischemia = 1: not MI| | | Old ischemia = 0:| | | | Family history of MI = 1: MI| | | | Family history of MI = 0:| | | | | age <= 61 years: MI| | | | | age > 61 years:| | | | | | Duration <= 2 hours : not MI| | | | | | Duration > 2 hours:| | | | | | | T wave changes = 1: MI| | | | | | | T wave changes = 0:| | | | | | | | Right arm pain = 1: not MI | | | | | | | | Right arm pain = 0:| | | | | | | | | Crackles = 0: not MI| | | | | | | | | Crackles = 1: MI
STelev or Qwave = 1: MISTelev or Qwave = 0: Duration >= 42 hr = 1: | STorTwave = 1: MI | STorTwave = 0: notMI Duration >= 42 = 0: | Shoulder,neck,arms = 1: | | LocalPressure = 1: notMI | | LocalPressure = 0: | | | age >= 40 = 1: | | | | PrevAngina = 1: | | | | | Duration >=10 =1: MI | | | | | Duration >=10 =0: notMI | | | | PrevAngina = 0: | | | | | LeftShoulder = 1: MI | | | | | LeftShoulder = 0: | | | | | | age >=50=1:MI | | | | | | age >=50=0:notMI | | | age >=40 = 0: notMI | Shoulder,neck,arms = 0: | | PainWorse = 1: MI | | PainWorse = 0: | | | Diaphoresis = 1: | | | | age >= 70 = 1: MI | | | | age >= 70 = 0: notMI | | | Diaphoresis = 0: notMI
STchange = +2STchange = normal ncpnitro = yes | chpainer = no | chpainer = yes | | s1 = arm,neck,shoulders | | s1 = SOB | | s1 = stomach | | | sex = male | | | sex = female | | s1 = pressure,pain,discomfort in chest | | | sex = female | | | sex = male | | | | age > 81.5 years | | | | age < 81.5 years | | | | | age < 45.5 years | | | | | age > 45.5 years ncpnitro = no | chpainer = no | chpainer = yes | | twave = normal | | twave = -1 | | | sex = female | | | sex = male | | | | age < 73.5 years | | | | age > 73.5 years
STchange = -2| ncpnitro = yes| ncpnitro = no| | systolic BP > 202 mmHg| | systolic BP < 202 mmHg| | | qwave = asmi| | | qwave = normal | | | | systolic BP > 178 mmHg| | | | systolic BP < 178 mmHg| | | | | age > 83.5 years| | | | | age < 83.5 years| | | | | | heart rate < 77 bpm| | | | | | heart rate > 77 bpm| | | | | | | heart rate < 89 bpm| | | | | | | heart rate > 89 bpmSTchange = -1| s1 = stomach| s1 = rapid,skipping heartbeats| s1 = pain in arms,neck,shoulders| s1 = SOB| s1 = fainted,dizzy,lightheaded| | age > 74 years| | age < 74 years| | | hxmi = yes| | | hxmi = no
| s1 = pressure,pain,discomfort in chest| | heart rate > 131 bpm| | heart rate < 131 bpm| | | systolic BP > 197 mmHg| | | systolic BP < 197 mmHg| | | | heart rate < 111 bpm| | | | heart rate > 111 bpm STchange = -0.5| ncpnitro = yes| ncpnitro = no STchange = flat| ncpnitro = yes| ncpnitro = noSTchange = +1| age > 87.5 years| age < 87.5 years| | chpainer = yes| | chpainer = no| | | qwave = ami| | | qwave = normal| | | | heart rate < 69 bpm| | | | heart rate > 69 bpm
Snapshot from the Long Tree| | systolic BP > 202 mmHg| | systolic BP < 202 mmHg | | | qwave = asmi| | | qwave = normal | | | | systolic BP > 178 mmHg| | | | systolic BP < 178 mmHg | | | | | age > 83.5 years| | | | | age < 83.5 years| | | | | | heart rate < 77 bpm| | | | | | heart rate > 77 bpm | | | | | | | heart rate < 89 bpm | | | | | | | heart rate > 89 bpm
Tree AttributesGoldman: FT: Long:ST elevation ST elevation ST change or Q waves New Q waves Q wavesDuration DurationST or T wave T wave T waveShoulder, neck, arm Right arm Arm,neck,shoulderAge Age AgeLocal Pressure ST depression Stomach painPrevious angina Old ischemia Fainted, dizzy, lightheaded Left shoulder Family history Systolic BPPain worse Crackles Heart rateDiaphoresis Rapid/skipping beats Chest pain History of MI Nitroglycerin use Shortness of breath Sex
Goldman, FT, and Long Trees: Performance on Each’s OWN Test Set Goldman FT Tree Long TreeSensitivity = 90.9% 81.4% 66.1% Specificity = 69.7% 92.1% 85.8% PPV = 35.4% 72.9% 68.3% Accuracy = 73.1% 89.9% 80.1%
Goldman Tree vs FT Tree on Edinburgh data, p < 0.0001
Goldman Tree vs. FT Tree on Sheffield data, p < 0.01
Logistic Regression Results FT LR Equation Coefficients: Constant -2.14 ST elevation 2.96 New Q waves 2.00 ST depression 1.76 Crackles 0.807 Old ischemia -0.86 Family history 0.43 Age -0.016 Duration -.0046 T wave changes 0.805 Right arm pain -0.22
Comparison of LR Models Kennedy FT LR Selker LRConstant -3.07 -2.14 ST elevation 3.16 2.96New Q waves 1.37 2.00 ST depression 1.95 1.76 LV Failure (Crackles) 1.54 0.807 Old ischemia -0.86 Family history of MI 0.43 Age -0.016Duration -0.0046 T wave 0.805 Right arm -0.22 Vomiting 0.68 Hypoperfusion 0.47Chest pain #1 Sx 0.71Chest pain/24h 1.00 T wave nl/flat 1.13 Nitroglycerin use 0.51Previous MI 0.42STchange nl/flat 0.77STchange normal 0.83
LR Results: Comparison of ROC Areas FT LR Kennedy Edinburgh: 94% 94% p = 0.50Sheffield: 89% 91% p = 0.17(ROC curve area for Selker LR model = 89%)
ROC Curves for Trees vs. LR on Edinburgh data
ROC Curve for Trees vs. LR on Sheffield data
Trees vs. Logistic RegressionModel: Edinburgh: Sheffield:FT Tree 94% 90%Goldman Tree 84% 84%FT LR 94% 89%Kennedy LR 94% 91%- Differences between FT Tree and Kennedy LR not significant (p = 0.41 Edinburgh; p = 0.17 Sheffield)
Discussion
Additional Benefits of Classification Trees
Clinical Benefits
Future Work
Acknowledgments
Email: chris@medg.lcs.mit.edu
Home Page: http://medg.lcs.mit.edu/people/
Other information: Medical Computing Class 2/19/98 Lecture Slides