Code
library(tidyverse)
library(tidymodels)
library(here)
library(cowplot)
library(knitr)
library(kableExtra)
Olivia Hemond
March 1, 2024
This dataset describes the survival and growth of two species of palmetto (Serenoa repens and Sabal etonia), as measured at Archbold Biological Station in Florida. Both species are types of fan palm. The scientists studying these species measured plant height, canopy length and width, number of green leaves, and other characteristics of the plants’ growth. Data were collected annually from 1981 - 1997, and then again in 2001 and in 2017.
The purpose of this analysis is to test whether measurements of height, canopy length, canopy width, and number of green leaves can be used to classify whether an unknown plant is a member of S. repens or S. etonia. I used binary logistic regression to create the classification model.
My analytical process went as follows:
Obtain and tidy data
Explore and visualize data
Compare two binary logistic regression models
Define the two model formulas
Use ten-fold cross validation to fit the models
Compare the predictive performance of the two models
Train selected model
Present classification results
Generate species predictions based on best model
Calculate the number of correct and incorrect predictions for each species
Calculate the percent of correct predictions for each species
See the corresponding sections in this analysis for more detailed descriptions of the steps involved in this analysis.
Data source: Abrahamson, W.G. 2019. Survival, growth and biomass estimates of two dominant palmetto species of south-central Florida from 1981 - 2017, ongoing at 5-year intervals ver 1. Environmental Data Initiative. https://doi.org/10.6073/pasta/f2f96ec76fbbd4b9db431c79a770c4d5
I removed unnecessary columns and assigned factor levels to my two species ID numbers. Species 1 = S. repens and Species 2 = S. etonia
I am interested in whether height, width, length, and number of green leaves are good variables to use to differentiate between my two species. To be good predictors, there should be notable differences in these values between the species. I decided to explore these four variables using boxplots.
height_plot <- ggplot(data = p_clean) +
geom_boxplot(aes(x = as_factor(species), y = height), fill = "olivedrab3", color = "olivedrab") +
scale_x_discrete(labels = c("1" = "S. repens", "2" = "S. etonia")) +
labs(x = "", y = "Height (cm)") +
theme_minimal() +
theme(axis.text.x = element_text(face = "italic"))
width_plot <- ggplot(data = p_clean) +
geom_boxplot(aes(x = as_factor(species), y = width), fill = "olivedrab3", color = "olivedrab") +
scale_x_discrete(labels = c("1" = "S. repens", "2" = "S. etonia")) +
labs(x = "", y = "Width (cm)") +
theme_minimal() +
theme(axis.text.x = element_text(face = "italic"))
length_plot <- ggplot(data = p_clean) +
geom_boxplot(aes(x = as_factor(species), y = length), fill = "olivedrab3", color = "olivedrab") +
scale_x_discrete(labels = c("1" = "S. repens", "2" = "S. etonia")) +
labs(x = "", y = "Length (cm)") +
theme_minimal() +
theme(axis.text.x = element_text(face = "italic"))
leaves_plot <- ggplot(data = p_clean) +
geom_boxplot(aes(x = as_factor(species), y = green_lvs), fill = "olivedrab3", color = "olivedrab") +
scale_x_discrete(labels = c("1" = "S. repens", "2" = "S. etonia")) +
labs(x = "", y = "Number of Green Leaves") +
theme_minimal() +
theme(axis.text.x = element_text(face = "italic"))
Takeaways:
Plant heights between the two species look fairly similar, so height may not be a good predictor of species. There may be some difference in plant widths, and even more noticeable differences in length and number of green leaves. I expect length and green leaves to be especially important predictors, since they seem to show the largest differences between species.
I am interested in comparing two models. One model predicts species based on height, length, width, and green leaves (Model 1). The other model predicts species based only on height, width, and green leaves (Model 2). The difference between the two is whether length is used as a predictor variable.
Now that the two model formulas are defined, it is time to see which performs better at classification. I decided to use ten-fold cross validation to repeatedly fit the models to ten different subsets of the data, and then I extracted the average performance metrics from each model.
# set up model
blr_mdl <- logistic_reg() %>%
set_engine('glm')
# create workflows
blr_wf_1 <- workflow() %>%
add_model(blr_mdl) %>%
add_formula(f1)
blr_wf_2 <- workflow() %>%
add_model(blr_mdl) %>%
add_formula(f2)
# apply the workflows to the folded data
blr_fit_folds_1 <- blr_wf_1 %>%
fit_resamples(p_folds)
blr_fit_folds_2 <- blr_wf_2 %>%
fit_resamples(p_folds)
collect_metrics(blr_fit_folds_1) %>%
select(-.config) %>%
rename(metric = .metric,
estimator = .estimator,
standard_error = std_err) %>%
kbl() %>%
kable_styling("basic", position = "center")
collect_metrics(blr_fit_folds_2) %>%
select(-.config) %>%
rename(metric = .metric,
estimator = .estimator,
standard_error = std_err) %>%
kbl() %>%
kable_styling("basic", position = "center")
metric | estimator | mean | n | standard_error |
---|---|---|---|---|
accuracy | binary | 0.9168905 | 100 | 0.0007788 |
roc_auc | binary | 0.9725080 | 100 | 0.0003592 |
metric | estimator | mean | n | standard_error |
---|---|---|---|---|
accuracy | binary | 0.8988665 | 100 | 0.0008640 |
roc_auc | binary | 0.9634580 | 100 | 0.0004167 |
Results:
Based on the results of the cross validation, I am choosing Model 1 as the better model. Model 1 has a greater area under the receiver operating characteristic (ROC) curve than Model 2 has (0.9725 compared with 0.9635). The model with a greater area under the curve is the better classifier. In addition, Model 1 had a slightly higher accuracy rate than Model 2 (91.7% compared with 89.9%). Though Model 1 ended up winning based on area under the curve and accuracy, it is worth noting that both models performed very well.
I next trained Model 1 using the entire dataset (without any folding) to obtain my final coefficient results for each predictor variable (shown in the “estimate” column of the table below).
term | estimate | std.error | p.value |
---|---|---|---|
(Intercept) | 3.2266851 | 0.1420708 | 0 |
height | -0.0292173 | 0.0023061 | 0 |
length | 0.0458233 | 0.0018661 | 0 |
width | 0.0394434 | 0.0021000 | 0 |
green_lvs | -1.9084747 | 0.0388634 | 0 |
Using my finalized Model 1, I decided to evaluate its predictive strength. I generated predicted species classifications for each observed set of plant height, width, length, and green leaves in the dataset, then compared these predictions to the actual species identities. Predictions were made using a 50% cutoff, meaning that a species was classified as species 1 if the probability of it being species 1, based on my Model 1 results, was 50% or greater.
species | n_correct | n_incorrect | p_correct |
---|---|---|---|
1 | 5548 | 564 | 0.9077225 |
2 | 5701 | 454 | 0.9262388 |
Conclusion:
My fitted Model 1 classified observations of species 1 (S. repens) correctly 91% of the time, and of species 2 (S. etonia) correctly 93% of the time. While the model is not perfect, it does have a very high accuracy rate. The model could potentially be improved if other strong predictor variables could be found and included.