Project 1: Exploratory Data Analysis

April 4, 2021   

NAME: Ayanna Fisher EID: adf2353 DUE: 04/04/2021


INTRODUCTION
For this project I wanted to see if there was a relationship between the condition of roads/infrastructure of each state and their corresponding GDP status for each year. The infrastructure data frame used in this assignment comes from the U.S. Bureau of Transportation Statistics –> https://www.bts.gov/road-condition . It originally had 5 columns, listing the state, international roughness index, class, year, and total miles. I expect to tidy this up by taking out the IRI column and use pivot wider on the class column. The GDP dataset was sourced from the U.S. Department of Commerce –> https://apps.bea.gov/iTable/iTable.cfm?reqid=70&step=1#reqid=70&step=1 . This df has 31 variables and 484 observations. The variables it contains include, GeoName (states), years (1997-2019), region, etc. I will likely use pivot longer to combine all of the years into 1 column and delete a few columns that won’t be utilized for this project for this project.

1. TIDYING

library(tidyverse)
library(tidyr)
library(dplyr)
library(kableExtra)
library(ggplot2)

# GDP by STATE df (1997-2019)
stGDP <- read_csv("SAGDP1__ALL_AREAS_1997_2019.csv")

# TIDYING GDP DF
step1 <- stGDP %>% select(2:3, 7:8, 23:31)
step2 <- step1 %>% pivot_longer(c("2011":"2019"), names_to = "Year") %>% 
    separate(Year, into = "Year", convert = T)
step3 <- step2 %>% select(1:2, 5, 3, 6) %>% na.omit() %>% pivot_wider(names_from = "Description", 
    values_from = "value")
final_GDPdf <- step3 %>% rename(RealGDP = 4, CQI = 5, NominalGDP = 6, 
    Subsidies = 11) %>% select(1:6, 11) %>% mutate(Region = as.character(Region), 
    Year = as.character((Year)))

# tidy GDP df
final_GDPdf %>% head(5) %>% kbl(caption = "**Tidy GDP**") %>% 
    kable_styling(bootstrap_options = c("striped", "hover", "condensed", 
        "responsive"))
Table 1: Tidy GDP
GeoName Region Year RealGDP CQI NominalGDP Subsidies
Alabama 5 2011 185666.9 99.525 181929.9 -509.6
Alabama 5 2012 186553.9 100.000 186553.9 -484.7
Alabama 5 2013 188814.2 101.212 192166.5 -457.1
Alabama 5 2014 187568.0 100.544 195037.7 -466.9
Alabama 5 2015 189428.8 101.541 200197.5 -440.7
# Road conditions / Infrastructure df (2011-2019)
allRC_ <- read_csv("allRC_.csv")

# TIDYING ROAD CONDITIONS DF
st1 <- allRC_ %>% relocate(1, 4, 2, 5) %>% select(1:4)
final_RCdf <- st1 %>% pivot_wider(names_from = "Class", values_from = "Miles") %>% 
    rename(AcceptableMiles = "Acceptable (miles)", AcceptablePercent = "Percent Acceptable", 
        TotalMiles = "Total (miles)") %>% mutate(Year = as.character(Year))

# tidy infrastructure df
final_RCdf %>% head(5) %>% kbl(caption = "**Tidy Infrastructure**") %>% 
    kable_styling(bootstrap_options = c("striped", "hover", "condensed", 
        "responsive"))
Table 1: Tidy Infrastructure
State Year AcceptableMiles AcceptablePercent TotalMiles
Alabama 2019 23249.922 0.8870157 26211.399
Alaska 2019 2811.892 0.8296608 3389.207
Arizona 2019 12113.409 0.7913743 15306.800
Arkansas 2019 6050.767 0.9332849 6483.301
California 2019 26223.917 0.6484012 40443.969

Since the GDP data frame was the larger one of the 2 I’m working on, I decided to tidy that one first. The first thing that I needed to do was delete the columns that I was sure I wouldn’t use. Because the road condition df only dates back to 2011, I deleted columns 9 through 22, which corresponded with years 1997 to 2010. In total, I ended up deleting 18 columns for the first step. I retained columns that would be crucial when combined later on, like States, region, years 2011-2019, and description (the type of GDP). The next thing on the agenda was to use pivot_longer to put all of the years from 2011 to 2019 into one column. Step 3 manipulated the data by getting rid of rows that contained NA values, deleting the column that contained the “X” character that we ha previously separated from the years, and re-order the columns. This was also the step where I first used pivot_wider, to alter the Description column so that each GDP variable (Real, Nominal, etc) had its own column. Lastly was the step to renaming a few essential columns to make it easier to code later, with the function rename(). Next, I had to tidy the road conditions dataset. Firstly, I organized the variables using relocate(), then selected the variables needed for this project (state, year, class, miles). This was a relatively long df so I used to the pivot_wider on the Class variable. This would allow the data to have 3 separate columns for Acceptable Miles, Percentage, and Total Miles.


2. JOINING

# inner_join() ->> combines based on common ID
comb1 <- final_RCdf %>% inner_join(final_GDPdf, by = c(State = "GeoName", 
    "Year"))
final_combined <- comb1 %>% relocate(1, 6, 2:5, 7, 9, 8, 10)
final_combined %>% head() %>% kbl(caption = "**Joined GDP and Infrastructure Dataset**") %>% 
    kable_styling(bootstrap_options = c("striped", "hover", "condensed", 
        "responsive"))
Table 2: Joined GDP and Infrastructure Dataset
State Region Year AcceptableMiles AcceptablePercent TotalMiles RealGDP NominalGDP CQI Subsidies
Alabama 5 2019 23249.922 0.8870157 26211.399 200829.4 228142.6 107.652 -593.7
Alaska 8 2019 2811.892 0.8296608 3389.207 53255.2 54385.6 92.269 -113.9
Arizona 6 2019 12113.409 0.7913743 15306.800 323597.6 370119.1 120.715 -809.8
Arkansas 5 2019 6050.767 0.9332849 6483.301 117447.1 130954.1 109.031 -1223.6
California 8 2019 26223.917 0.6484012 40443.969 2800505.4 3132800.6 130.615 -9110.5
Colorado 7 2019 12655.845 0.7777576 16272.224 356280.2 392986.0 130.222 -1213.8

The function inner_join was used to merge the final GDP df and the final infrastructure df. Inner_join was implemented instead of other functions like full/right/left_join because it combines via common ID (in this case State = GeoName and Year) and deletes nonmatches. Then I manipulated the combined df with relocate() to move region next to state.


3. WRANGLING PT1: DPLYR

# lag() gives percent increase from previous row
mut1 <- final_combined %>% arrange(State, Year) %>% mutate(GDPpercentChange = (RealGDP - 
    lag(RealGDP))/lag(RealGDP))

# case_when() creates categorical varia from numeric
final_df <- mut1 %>% mutate(letterGrade = case_when(AcceptablePercent > 
    0.9 ~ "A", AcceptablePercent < 0.9 & AcceptablePercent >= 
    0.8 ~ "B", AcceptablePercent < 0.8 & AcceptablePercent >= 
    0.7 ~ "C", AcceptablePercent < 0.7 & AcceptablePercent >= 
    0.6 ~ "D", AcceptablePercent < 0.6 ~ "F")) %>% select(1:4, 
    6, 5, 12, 7:8, 10:11)

final_df %>% head(5) %>% kbl(caption = "**Final DF With 2 New Variables**") %>% 
    kable_styling(bootstrap_options = c("striped", "hover", "condensed", 
        "responsive"))
Table 3: Final DF With 2 New Variables
State Region Year AcceptableMiles TotalMiles AcceptablePercent letterGrade RealGDP NominalGDP Subsidies GDPpercentChange
Alabama 5 2011 20988 22943 0.9147888 A 185666.9 181929.9 -509.6 NA
Alabama 5 2012 15893 17235 0.9221352 A 186553.9 186553.9 -484.7 0.0047774
Alabama 5 2013 18307 20545 0.8910684 B 188814.2 192166.5 -457.1 0.0121161
Alabama 5 2014 21856 24254 0.9011297 A 187568.0 195037.7 -466.9 -0.0066001
Alabama 5 2015 11276 11554 0.9759391 A 189428.8 200197.5 -440.7 0.0099207

To create a percent change in time for Real GDP, the dplyr vector function, lag(), was used to manipulate and create a new variable named “GDPpercentChange”. The second new variable made was a letter grade variable that took the percentages of acceptable miles and converted them into the A-D and F; case_when was used for this part.


# average acceptable amount of miles from 2014 to 2019 when
# grouped by State state with top average
final_df %>% group_by(State) %>% filter(Year >= 2014) %>% summarize(mean_AccM = mean(AcceptableMiles, 
    na.rm = T), mean_pctM = mean(AcceptablePercent, na.rm = T)) %>% 
    arrange(desc(mean_pctM)) %>% head(1) %>% kbl(caption = "**State with Top Average Acceptable Miles**") %>% 
    kable_styling(bootstrap_options = c("striped", "hover", "condensed", 
        "responsive"))
Table 4: State with Top Average Acceptable Miles
State mean_AccM mean_pctM
Idaho 4416.838 0.9592973
# state with bottom average
final_df %>% group_by(State) %>% filter(Year >= 2014) %>% summarize(mean_AccM = mean(AcceptableMiles, 
    na.rm = T), mean_pctM = mean(AcceptablePercent, na.rm = T)) %>% 
    arrange(mean_pctM) %>% head(2) %>% kbl(caption = "**State with Bottom Average Acceptable Miles**") %>% 
    kable_styling(bootstrap_options = c("striped", "hover", "condensed", 
        "responsive"))
Table 4: State with Bottom Average Acceptable Miles
State mean_AccM mean_pctM
District of Columbia 27.39643 0.0615769
Rhode Island 732.86651 0.4710058

When grouped by State and only looking at years from 2014 to 2019, the State with the highest mean percentage of acceptable miles is Idaho, with a mean value of 95% or 4,416.84 miles, while the place with the lowest mean percentage of acceptable miles is Washington D.C. with and average of 27.39 acceptable miles (6.16%). Since D.C. isn’t technically a state, Rhode Island comes in first place as the state with a lowest value of 732.87 miles (or 47%).


# States with highest and lowest Real GDP in 2019
final_df %>% group_by(State) %>% filter(Year == 2019) %>% select(1:4, 
    7:8) %>% arrange(desc(RealGDP)) %>% head(3) %>% kbl(caption = "**States with Highest GDP in 2019**") %>% 
    kable_styling(bootstrap_options = c("striped", "hover", "condensed", 
        "responsive"))
Table 5: States with Highest GDP in 2019
State Region Year AcceptableMiles letterGrade RealGDP
California 8 2019 26223.92 D 2800505
Texas 6 2019 69296.62 C 1764357
New York 2 2019 20097.64 C 1490678
final_df %>% group_by(State) %>% filter(Year == 2019) %>% select(1:4, 
    7:8) %>% arrange(RealGDP) %>% head(3) %>% kbl(caption = "**States with Lowest GDP in 2019**") %>% 
    kable_styling(bootstrap_options = c("striped", "hover", "condensed", 
        "responsive"))
Table 5: States with Lowest GDP in 2019
State Region Year AcceptableMiles letterGrade RealGDP
Vermont 1 2019 3164.933 B 29806.2
Wyoming 7 2019 6793.627 A 39214.0
South Dakota 4 2019 12497.290 B 47559.7
# States with highest and lowest Acceptable Miles in 2019
final_df %>% group_by(State) %>% filter(Year == 2019) %>% select(1:4, 
    7:8) %>% arrange(desc(AcceptableMiles, RealGDP)) %>% head(3) %>% 
    kbl(caption = "**States with Highest Acceptable Miles in 2019**") %>% 
    kable_styling(bootstrap_options = c("striped", "hover", "condensed", 
        "responsive"))
Table 5: States with Highest Acceptable Miles in 2019
State Region Year AcceptableMiles letterGrade RealGDP
Texas 6 2019 69296.62 C 1764357.2
Georgia 5 2019 29499.46 A 547422.7
Minnesota 4 2019 26337.43 B 341041.4
final_df %>% group_by(State) %>% filter(Year == 2019) %>% select(1:4, 
    7:8) %>% arrange(AcceptableMiles, RealGDP) %>% head(3) %>% 
    kbl(caption = "**States with Lowest Acceptable Miles in 2019**") %>% 
    kable_styling(bootstrap_options = c("striped", "hover", "condensed", 
        "responsive"))
Table 5: States with Lowest Acceptable Miles in 2019
State Region Year AcceptableMiles letterGrade RealGDP
District of Columbia 2 2019 31.37858 F 123929.3
Rhode Island 1 2019 849.19907 F 53668.0
Hawaii 8 2019 889.48000 F 82471.4

The data displayed above may indicate that there is a moderately strong relationship between how much money a state has and its acceptable road conditions.


# Mean Real GDP for each region (2011-2019) and the mean
# acceptable miles (%)
final_df %>% group_by(Region) %>% summarize(meanGDP = mean(RealGDP, 
    na.rm = T), meanpctM = mean(AcceptablePercent, na.rm = T)) %>% 
    arrange(desc(meanGDP)) %>% kbl(caption = "**Mean GDP for Each Region in the U.S.**") %>% 
    kable_styling(bootstrap_options = "striped", full_width = F)
Table 6: Mean GDP for Each Region in the U.S.
Region meanGDP meanpctM
8 558101.8 0.7423472
6 531826.8 0.7965897
2 520362.3 0.6043135
3 472222.2 0.8196310
5 303902.9 0.8467925
4 159013.2 0.8776040
1 154117.5 0.6856085
7 120013.3 0.8745784

The region with the lowest mean Real GDP is region 8, of $558,101.8; also known as the Far West, this region has an average percentage of 74.23% acceptable miles. The region with the highest mean Real GDP is region 7, the Rocky Mountains, with an average percentage of 87.46% acceptable miles, it has a mean real GDP of $ 120,013.30.


3. WRANGLING PT2: SUMMARY STATISTICS

# mean, median, sd, n, quantile, min, max

final_df %>% select(4:6, 8:11) %>% summarize_all(mean, na.rm = T) %>% 
    kbl(caption = "**Mean Values**") %>% kable_styling(bootstrap_options = c("striped", 
    "hover", "condensed", "responsive"))
Table 7: Mean Values
AcceptableMiles TotalMiles AcceptablePercent RealGDP NominalGDP Subsidies GDPpercentChange
11484.96 14389.93 0.78759 338115.7 356419 -1202.347 0.1890752
final_df %>% select(4:6, 8:11) %>% summarize_all(median, na.rm = T) %>% 
    kbl(caption = "**Median Values**") %>% kable_styling(bootstrap_options = c("striped", 
    "hover", "condensed", "responsive"))
Table 7: Median Values
AcceptableMiles TotalMiles AcceptablePercent RealGDP NominalGDP Subsidies GDPpercentChange
10298 12451 0.8140805 193204.7 204454.7 -833.6 0.0162134
final_df %>% select(4:6, 8:11) %>% summarize_all(n_distinct, 
    na.rm = T) %>% kbl(caption = "**Unique Values (n)**") %>% 
    kable_styling(bootstrap_options = c("striped", "hover", "condensed", 
        "responsive"))
Table 7: Unique Values (n)
AcceptableMiles TotalMiles AcceptablePercent RealGDP NominalGDP Subsidies GDPpercentChange
455 450 458 459 459 454 458
final_df %>% select(4:6, 8:11) %>% summarize_all(quantile, na.rm = T) %>% 
    kbl(caption = "**Quantiles (0%, 25%, 50%, 75%, 100%)**") %>% 
    kable_styling(bootstrap_options = c("striped", "hover", "condensed", 
        "responsive"))
Table 7: Quantiles (0%, 25%, 50%, 75%, 100%)
AcceptableMiles TotalMiles AcceptablePercent RealGDP NominalGDP Subsidies GDPpercentChange
0.000 0.00 0.0315315 28367.9 28134.60 -9110.50 -0.9294062
4634.828 6324.00 0.7352503 81621.7 87639.45 -1336.75 0.0057819
10298.000 12451.00 0.8140805 193204.7 204454.70 -833.60 0.0162134
16792.000 20835.50 0.8944389 438981.5 441234.85 -376.30 0.0295030
69296.615 88738.07 0.9954885 2800505.4 3132800.60 -81.80 16.8023706
final_df %>% select(4:6, 8:11) %>% summarize_all(min, na.rm = T) %>% 
    kbl(caption = "**Minimum Values**") %>% kable_styling(bootstrap_options = c("striped", 
    "hover", "condensed", "responsive"))
Table 7: Minimum Values
AcceptableMiles TotalMiles AcceptablePercent RealGDP NominalGDP Subsidies GDPpercentChange
0 0 0.0315315 28367.9 28134.6 -9110.5 -0.9294062
final_df %>% select(4:6, 8:11) %>% summarize_all(max, na.rm = T) %>% 
    kbl(caption = "**Maximum Values**") %>% kable_styling(bootstrap_options = c("striped", 
    "hover", "condensed", "responsive"))
Table 7: Maximum Values
AcceptableMiles TotalMiles AcceptablePercent RealGDP NominalGDP Subsidies GDPpercentChange
69296.62 88738.07 0.9954885 2800505 3132801 -81.8 16.80237
# group_by() 1 variable median values for each year
final_df %>% group_by(Year) %>% select(4:6, 8:11) %>% summarize_all(median, 
    na.rm = T) %>% arrange(desc(Year)) %>% kbl(caption = "**Median Values for each year (2011-2019)**") %>% 
    kable_styling(bootstrap_options = "striped", full_width = F)
Table 7: Median Values for each year (2011-2019)
Year AcceptableMiles TotalMiles AcceptablePercent RealGDP NominalGDP Subsidies GDPpercentChange
2019 11200.49 13024.44 0.8190058 214933.7 247543.8 -1140.5 0.0160876
2018 10308.00 12764.00 0.8128666 209012.5 235286.9 -891.0 0.0252173
2017 11055.00 12667.00 0.8155894 202644.5 223414.0 -782.9 0.0153420
2016 10064.00 12571.00 0.8171199 196477.4 213584.6 -851.6 0.0114175
2015 11276.00 12647.00 0.8345023 191864.2 202718.6 -814.1 0.0263427
2014 11353.00 12664.00 0.8148526 186307.0 193546.4 -798.1 0.0202864
2013 8384.00 9853.00 0.7938532 179499.1 182837.0 -798.8 0.0092313
2012 10369.00 12050.00 0.8064304 176323.1 176323.1 -795.8 0.0116124
2011 10171.00 11320.00 0.7984190 174202.7 170620.6 -827.1 -0.2537898
# group_by() 2 variables median values for 2011 and 2019,
# letters A and B
final_df %>% group_by(Year, letterGrade) %>% filter((Year == 
    2011 | Year == 2019) & (letterGrade == "A" | letterGrade == 
    "B")) %>% select(4:6, 8:11) %>% summarize_all(median, na.rm = T) %>% 
    arrange(desc(Year)) %>% kbl(caption = "Median Values of States rated A and B in 2011 and 2019") %>% 
    kable_styling(bootstrap_options = c("striped", "hover", "condensed", 
        "responsive"))
Table 7: Median Values of States rated A and B in 2011 and 2019
Year letterGrade AcceptableMiles TotalMiles AcceptablePercent RealGDP NominalGDP Subsidies GDPpercentChange
2019 A 6793.627 7173.702 0.9341532 173515.4 194658.1 -939.60 0.0157415
2019 B 11848.888 13598.657 0.8574128 180444.4 203171.0 -965.75 0.0168551
2011 A 10171.000 10970.000 0.9365276 173881.1 170004.7 -683.70 -0.0165036
2011 B 14719.000 17384.500 0.8791105 280414.2 274859.2 -1085.80 -0.2044551

2019 had the highest acceptable miles, with a value of 69,296.2, and it had the highest values for both Real GDP ($2,800505) and Nominal GDP ($3,132,801). That being said, 2019 had one of the smallest GDP percent change (5.21%), whereas 2011 had the largest percent change, with a staggering value of 168.02%. It seems that the median acceptable miles for states that earned an A in 2011 decreased by 3,377.37 miles by 2019.


4. VISUALIZATION: HEAT MAP

library(reshape2)
# correlation matrix
cor_df <- final_df[, c(4:6, 8:11)]
round_cor_df <- round(cor(cor_df, use = "na.or.complete"), 2)

# upper triangle of matrix
get_upper_tri <- function(round_cor_df) {
    round_cor_df[lower.tri(round_cor_df)] <- NA
    return(round_cor_df)
}
# pattern
reorder_df <- function(round_cor_df) {
    dd <- as.dist((1 - round_cor_df)/2)
    hc <- hclust(dd)
    round_cor_df <- round_cor_df[hc$order, hc$order]
}
cordf <- reorder_df(round_cor_df)
upper <- get_upper_tri(cordf)
# Melt correlation matrix
melted_cordf <- melt(upper, na.rm = TRUE)

# Correlation Heat Map
ggplot(melted_cordf, aes(Var2, Var1, fill = value)) + geom_tile(color = "white") + 
    scale_fill_gradient2(low = "steelblue3", high = "indianred3", 
        mid = "oldlace", midpoint = 0, limit = c(-1, 1), space = "Lab", 
        name = "Correlation Heatmap") + theme_minimal() + theme(axis.text.x = element_text(angle = 50, 
    vjust = 1, size = 10, hjust = 1)) + coord_fixed() + geom_text(aes(Var2, 
    Var1, label = value), color = "black", size = 3) + theme(axis.title.x = element_blank(), 
    axis.title.y = element_blank(), panel.grid.major = element_blank(), 
    panel.border = element_blank(), panel.background = element_blank(), 
    axis.ticks = element_blank(), legend.justification = c(1, 
        0), legend.position = c(0.65, 0.75), legend.direction = "horizontal") + 
    guides(fill = guide_colorbar(barwidth = 7.5, barheight = 1, 
        title.position = "top", title.hjust = 0.5))

This heat map displays the correlation between the 7 numeric variables in the final dataframe. The variables with the strongest positive correlation are [Acceptable Miles and Total Miles] where r=0.97 and [Nominal GDP and Total Miles] where r=0.65. It is an interesting conclusion that states with more miles of road tend to also have more affluence in spending. Both [Nominal and Real GDP] have strong negative correlation with [Subsidies], with r=-0.94. This makes sense due to the fact that states with a large GDP tend to not need financial assistance from the government. On the other hand, the weakest correlation between all of the variables is [Acceptable Percent and GDP percent change], where r=-0.02. This result could be due to the fact that the GDP percent change depended on the value of the previous year, whereas, as long as a state stayed consistent in maintaining road conditions, the percent of acceptable miles would barely differ.


4. VISUALIZATION: GGPLOT 1

ggplot(final_df, aes(x = RealGDP, y = AcceptableMiles, color = Region)) + 
    geom_smooth(aes(fill = Region), alpha = 0.15) + scale_fill_brewer(palette = "Paired") + 
    geom_point(alpha = 0.125, size = 1.5) + stat_summary(fun.data = final_df, 
    geom = "point") + labs(x = "GDP ($)", y = "Acceptable Miles (mi)", 
    title = "Overall Acceptable Miles for Each Region as a Function of GDP") + 
    scale_x_log10(labels = scales::number) + scale_y_log10(labels = scales::number) + 
    scale_y_continuous(breaks = seq(-15000, 70000, 15000))

The above plot is a LOESS plot. In it, you can clearly see the relationship between acceptable miles and GDP for each region. Region 6, the Southwest, is the most notable of the regions within the graph, having peaks in both the $170,000 range and $1.5 million range for real GDP. For all regions, the amount of acceptable miles increases as real GDP increases. Region 1 has the highest amount of acceptable miles with the lowest GDP, in the $30,000 range, while region 6 has the highest acceptable miles at around $1.45 million GDP.


4. VISUALIZATION: GGPLOT 2

final_df %>% ggplot(aes(x = Region, y = AcceptablePercent, fill = Region)) + 
    geom_boxplot(alpha = 0.65) + scale_fill_brewer(palette = "Paired") + 
    geom_jitter(width = 0.1, alpha = 0.25) + xlab("Regions") + 
    facet_wrap(~Year) + theme(axis.text.x = element_text(angle = 10, 
    hjust = 1), axis.text.y = element_text(angle = 25, hjust = 1)) + 
    labs(y = "Acceptable Miles (%)", title = "Acceptable Miles (%) of Each Region in the U.S. from 2011-2019")

Above is a grouped boxplot presenting the percentage of acceptable miles of each region in the United States from 2011 to 2019. From a glance, it is clear that regions 1 and 2 had the most drastic outliers, pulling both of their distributions percentage of quality roads down and increasing their variability. On the upside, Region 4 tended to have the highest values within its distribution, consistently staying above the 80% mark.


5. DIMENSIONALITY REDUCTION

# cluster data
library(cluster)


# summarize all to get rid of duplicate create an average of
# all variables over the years of 2011-2019
attempt <- final_df %>% group_by(State, Region) %>% summarize_all(mean, 
    na.rm = T) %>% ungroup(State, Region) %>% select(1:2, 4:5, 
    8:10)
# convert char variables to factor
dat2 <- attempt %>% mutate_if(is.character, as.factor) %>% column_to_rownames("State")

# compute gower distances
gower1 <- daisy(dat2, metric = "gower")

# silhouette width biggest silhouette width = number of
# clusters
sil_width <- vector()
for (i in 2:7) {
    pam_fit <- pam(gower1, diss = TRUE, k = i)
    sil_width[i] <- pam_fit$silinfo$avg.width
}
ggplot() + geom_line(aes(x = 1:7, y = sil_width)) + scale_x_continuous(name = "k", 
    breaks = 1:7) + labs(title = "Silhouette Width")

# run pam with k = 7
pam3 <- pam(gower1, k = 7, diss = T)

# average silhouette width
pam3$silinfo$avg.width
## [1] 0.4135715
# interpret fit with silhouette plot avg sil width = 0.41 ->>
# structure is weak and possibly artificial
plot(pam3, which = 2)

# Which states are the most similar? most different?
gower1 %>% as.matrix %>% as.data.frame %>% rownames_to_column %>% 
    pivot_longer(-1, values_to = "distance") %>% filter(rowname != 
    name) %>% filter(distance %in% c(min(distance), max(distance))) %>% 
    distinct() %>% kbl(caption = "**Minimum and Maximum Distances**") %>% 
    kable_styling(bootstrap_options = c("striped", "hover", "condensed", 
        "responsive"))
Table 8: Minimum and Maximum Distances
rowname name distance
California Rhode Island 0.9488841
New Hampshire Vermont 0.0088848
Rhode Island California 0.9488841
Vermont New Hampshire 0.0088848
# proportion of Region per cluster
dat2 %>% mutate(cluster = factor(pam3$clustering)) %>% select(cluster, 
    Region) %>% group_by(cluster, Region) %>% summarize(n = n()) %>% 
    mutate(prop = n/sum(n, na.rm = T)) %>% pivot_wider(-n, names_from = Region, 
    values_from = prop) %>% kbl(caption = "**Proportion of Each Region per Cluster**") %>% 
    kable_styling(bootstrap_options = c("striped", "hover", "condensed", 
        "responsive"))
Table 8: Proportion of Each Region per Cluster
cluster 5 6 2 8 4 7 1 3
1 0.9230769 0.0769231 NA NA NA NA NA NA
2 NA NA 0.1666667 0.8333333 NA NA NA NA
3 NA 0.2222222 NA NA 0.7777778 NA NA NA
4 NA 0.3333333 0.3333333 0.3333333 NA NA NA NA
5 NA NA 0.1666667 NA NA 0.8333333 NA NA
6 NA NA 0.2500000 NA NA NA 0.75 NA
7 NA NA 0.1666667 NA NA NA NA 0.8333333
library(GGally)
# change column names to have spaces
colnames(dat2) <- make.names(c("Region", "Acceptable Miles", 
    "Total Miles", "Real GDP", "Nominal GDP", "Subsidies"))

# ggpairs plot / cluster assignment
ggpairs(dat2, columns = 1:6, aes(color = as.factor(pam3$clustering)), 
    upper = list(continuous = wrap("cor", size = 1.75)), columnLabels = gsub(".", 
        " ", colnames(dat2), fixed = T), labeller = label_wrap_gen(10)) + 
    theme_gray(base_size = 8.5) + theme(axis.text.x = element_text(angle = 90, 
    hjust = 1, size = 8))

Since gower and pam do not work with duplicate values, the dataset was grouped by State and Region then averaged for all values. Variables Year, letterGrade, and GDP percentchange were all removed, retaining 2 categorical variables and 5 numeric. The categorical variables were then converted into factored values. To know how many clusters were needed, gower distances were computed and then a silhouette width vector was created. The plot shows that the ideal amount of clusters for this data is 7. With the amount of clusters determined, pam was computed. The average silhouette width was found to be 0.41, concluding that the structure of the relationship between the variables is barely acceptable, weak and possibly artificial. The states found to have the highest gower dissimilarity were California an Rhode Island. New Hampshire and Vermont were found to have the lowest Gower dissimilarity. 92% of cluster 1 consisted of region 5. 83% of cluster 2 and 7 contained regions 8 and 3, respectively. Region 2 was the only region to be found in 5 clusters. A ggpairs plot was created to visualize the assignment of clusters among the variables. Region 1 can be seen having the lowest values for every numeric variable, including subsidies. Region 4 looks to have the greatest range with high variability for everything, narrowing only for total miles, which may mean that it contains some states with a large amount of roads. Total miles has the correlation of all of the variables with Acceptable miles (cor = 0.972). Subsidies has a relatively negative correlation with all of the variables overall, however in Acceptable Miles, it is neat to note that clusters 4, 5, and 6 all have a positive correlation with Subsidies, with 4 having the strongest positive value of cor = 0.655.




comments powered by Disqus