Bank of America, National Association Customer Complaint Analysis
Author
Hue Nguyen Denke
Published
October 2, 2025
1 Introduction
This project analyzes the customer narrative complaint from Bank of America to find relationship between customer’s emotion with complaint dispute rate. In this project, I compare the emotional content of disputed vs. non-disputed complaints to identify emotional patterns that might predict complaint resolution difficulty.
Our dataset spans 14 years (2011-12-01 - 2025-09-24) with 11,109,951 complaints.
These complaints come from 7,756 US banks.
The top 5 companies with most complaints are:
Equifax, Inc.
Transunion Intermediate Holdings, Inc.
Experian Information Solutions, Inc.
Bank of America
Wells Fargo & Company
In the following analysis, we would focus on Bank of America only
5 Key Findings
5.1 High-level view of the customer complaint
The most common problem are likely related to fraud, debt or denied issue.
5.2 Net sentiment emotions related to each product
Net sentiment is the net emotion from each complaint (i.e net sentiment = positive - negative)
We can see the largest emotion gap is checking or saving accounts, following by credit card or prepaid card, mortgage and debt collection are also observed with large gap
Since other financial services product is quite general, Bank of America should conduct more analysis on the credit card, prepaid card and debt collection product to identify the root cause that cause negative complaints.
5.3 Comparative Analysis using nrc sentiment
Method: Compare the emotional content of disputed vs. non-disputed complaints
Goal: Identify emotional patterns that might predict complaint resolution difficulty
Result: Largest dispute ratio falls within trust and positive emotions
5.4 Perform statistical analysis to find correlation between emotion and dispute rate
---title: Bank of America, National Association Customer Complaint Analysis author: Hue Nguyen Denkedate: last-modifiedformat: html: toc: true toc-title: "Contents" toc-expand: 8 toc-depth: 6 toc-location: left code-fold: false code-tools: toggle: true code-summary: "Show the code" number-sections: true number-depth: 8 page-layout: full embed-resources: true anchor-sections: true smooth-scroll: trueengine: knitrknitr: opts_chunk: R.options: width: 120editor_options: chunk_output_type: console---```{r, echo=FALSE, results='hide'}knitr::opts_chunk$set(message = FALSE, warning = FALSE, echo = FALSE) ```# IntroductionThis project analyzes the customer narrative complaint from Bank of America to find relationship between customer's emotion with complaint dispute rate. In this project, I compare the emotional content of disputed vs. non-disputed complaints to identify emotional patterns that might predict complaint resolution difficulty.Source: `https://www.consumerfinance.gov/data-research/consumer-complaints/#download-the-data`# Data Dictionary```{r}library(dplyr)library(tidytext)library(janeaustenr)library(stringr)library(ggplot2)library(tidyr)library(readr)``````{r,eval=F}# 15 mins importing# data <- read.csv("complaints.csv")# once importing successfully, we save it to rds file# saveRDS(data, "data.rds")data <- readRDS("data.rds")``````{r,eval=F}# dim(data)```Our dataset includes the following columns:`Date.received`: Date that the complaint was received`Product`: Product that was complaint about (i.e Debt Collection, Mortgage,..)`Sub.product`: Sub category of the product (i.e Credit Card, Debit Card,...)`Issue`: Issue category`Sub.issue`: Sub issue category`Consumer.complaint.narrative`: Consumer explanation on the issue`Company.public.response`: Company public response via website or social media`Company`: Name of bank. Note: Our dataset includes all US banks, but for the range of this project I will filter out only Bank of America`State`: The state of the mailing address provided by the consumer`ZIP.code`: The mailing ZIP code provided by the consumer`Tags`: The department/office that the issue will be directed to`Consumer.consent.provided.`: Consumer consent to public issue`Submitted.via`: Platform the issue was submitted`Date.sent.to.company`: Date that the issue was sent to the bank`Company.response.to.consumer`: Company resolution to the issue`Timely.response`: Whether if this is a timely response or not`Consumer.disputed.`: Whether if the consumer dispute the resolution or not`Complaint.ID`: The unique identifier of each complaint# Data Cleaning Methodology To Ensure Tidy DataConverted all date column from character to `Date` format```{r,eval=F, echo=T}data$Date.received <- as.Date(data$Date.received, format = "%Y-%m-%d")data$Date.sent.to.company <- as.Date(data$Date.sent.to.company, format = "%Y-%m-%d")```Standardize empty cells to `NA` format```{r, eval = F, echo=T}data$Consumer.complaint.narrative[which(data$Consumer.complaint.narrative == "")] <- NA```# Data SummaryOur dataset spans 14 years (2011-12-01 - 2025-09-24) with 11,109,951 complaints.These complaints come from 7,756 US banks.```{r,eval=F}range(data$Date.received)```The top 5 companies with most complaints are:i. Equifax, Inc. ii. Transunion Intermediate Holdings, Inc.iii. Experian Information Solutions, Inc.iv. Bank of Americav. Wells Fargo & CompanyIn the following analysis, we would focus on Bank of America only```{r,eval=F}as.data.frame(sort(table(data$Company, useNA = "always"), decreasing = T)) -> df_bankas.character(df_bank$Var1[1:5])```# Key Findings```{r}# df_ba <- data |> subset(Company == "BANK OF AMERICA, NATIONAL ASSOCIATION")# # dim(df_ba)# # saveRDS(df_ba, "df_ba.rds")df_ba <-readRDS("df_ba.rds")df_ba$Consumer.complaint.narrative[which(df_ba$Consumer.complaint.narrative =="")] <-NA```## High-level view of the customer complaint```{r}# a. High level overview# Use word cloud and sentiment bing# get_sentiments("bing")# tokenize wordsdf_ba <- df_ba %>%unnest_tokens(word, Consumer.complaint.narrative)# get negative words from bingbing_negative <-get_sentiments("bing") %>%filter(sentiment =="negative")# count negative word to prep for word cloudnegative_word_count <- df_ba %>%inner_join(bing_negative) %>%count(word, sort =TRUE)# negative_word_count``````{r, results='hide'}library(wordcloud2)library(webshot)library(htmlwidgets)# negative_word_count <- readRDS("negative_word_count.rds")# set.seed(4)# # word_cloud_fig <- wordcloud2(data = negative_word_count,# size = 3,# gridSize = 3,# color = "random-dark",# shape = "circle",# backgroundColor = "white")# # saveWidget(word_cloud_fig,# "word_cloud_fig.html",# selfcontained = F)# # webshot::webshot("word_cloud_fig.html",# "word_cloud_fig.png",# vwidth = 2000,# vheight = 2000,# delay = 8)```The most common problem are likely related to fraud, debt or denied issue.## Net sentiment emotions related to each product```{r}# Create chart to analyze emotion based on each productcomplaint_product_sentiment <- df_ba %>%inner_join(get_sentiments("bing"),relationship ="many-to-many") %>%count(Product, sentiment) %>%pivot_wider(names_from = sentiment, values_from = n, values_fill =0) %>%mutate(sentiment = positive - negative)complaint_product_sentiment$Product <-reorder(complaint_product_sentiment$Product, complaint_product_sentiment$sentiment)``````{r}levels(complaint_product_sentiment$Product)[1:20] -> top10complaint_product_sentiment_top10 <- complaint_product_sentiment |>subset(Product %in% top10)complaint_product_sentiment_top10$Product <-as.character(complaint_product_sentiment_top10$Product)complaint_product_sentiment_top10$Product[8] <-"Credit reporting, credit repair services,\nor other personal consumer reports"complaint_product_sentiment_top10$Product <-reorder(complaint_product_sentiment_top10$Product, complaint_product_sentiment_top10$sentiment)library(RColorBrewer)n <-10qual_col_pals = brewer.pal.info[brewer.pal.info$category =='qual',]col_vector =unlist(mapply(brewer.pal, qual_col_pals$maxcolors, rownames(qual_col_pals)))set.seed(17)col_vector_a <- col_vector[sample(1:length(col_vector), size = n)]p1 <-ggplot(complaint_product_sentiment_top10, aes(x = Product, y = sentiment,fill = Product)) +geom_col() +labs(title ="Net Sentiment by Product",x ="Product",y ="Net Sentiment (Positive - Negative)") +theme_bw(base_size =10) +theme(axis.text.x =element_text(angle =90,hjust =1,color ="black")) +theme(text =element_text(face ="bold",color ="black"))# p1ggsave(filename ="p1.png",plot = p1,height =8,width =12,unit ="in",dpi =300)```* Net sentiment is the net emotion from each complaint (i.e net sentiment = positive - negative)* We can see the largest emotion gap is checking or saving accounts, following by credit card or prepaid card, mortgage and debt collection are also observed with large gap* Since other financial services product is quite general, Bank of America should conduct more analysis on the credit card, prepaid card and debt collection product to identify the root cause that cause negative complaints.## Comparative Analysis using `nrc` sentiment```{r}# Method: Compare the emotional content of disputed vs. non-disputed complaints# Goal: identify emotional patterns that might predict complaint resolution difficultynrc <-get_sentiments("nrc")# join with nrccomplaints_tibble <- df_ba %>%anti_join(stop_words) %>%# remove common stop wordinner_join(nrc, relationship ="many-to-many")# calculate emotion scores for each complaintemotion_scores <- complaints_tibble %>%count(Complaint.ID, Consumer.disputed., sentiment) %>%group_by(Complaint.ID) %>%mutate(emotion_proportion = n /sum(n)) %>%ungroup()# aggregate emotions by dispute statusemotion_by_dispute <- emotion_scores %>%group_by(Consumer.disputed., sentiment) %>%summarise(avg_score =mean(n, na.rm =TRUE),avg_proportion =mean(emotion_proportion, na.rm =TRUE),.groups ="drop" )``````{r}# Create stacked bar chart visualizationemotion_by_dispute$Consumer.disputed. <-factor(emotion_by_dispute$Consumer.disputed.,levels =c("No", "Yes", "N/A"))emotion_by_dispute |> dplyr:::arrange(desc(sentiment), desc(avg_score)) -> emotion_by_disputep2 <-ggplot(emotion_by_dispute,aes(x = sentiment,y = avg_proportion,fill = Consumer.disputed.)) +geom_bar(position ="Stack", stat ="identity") +scale_fill_manual(values =c("No"="#E74C3C","Yes"="#3498DB"),labels =c("No"="Disputed", "Yes"="Not Disputed")) +labs(title ="Emotional Content in Disputed vs. Non-Disputed Complaints",subtitle ="Comparison of normalized emotion scores",x ="Emotion",y ="Average Proportion of Words",fill ="Complaint Status") +theme_bw(base_size =12) +theme(axis.text.x =element_text(angle =60,hjust =1,size =15,face ="bold")) +theme(axis.text.y =element_text(angle =0,hjust =1,size =15,face ="bold")) # p1ggsave(filename ="p2.png",plot = p2,height =8,width =10,unit ="in",dpi =300)```* Method: Compare the emotional content of disputed vs. non-disputed complaints* Goal: Identify emotional patterns that might predict complaint resolution difficulty* Result: Largest dispute ratio falls within trust and positive emotions## Perform statistical analysis to find correlation between emotion and dispute rate```{r}# Statistical analysis - logistic regression# Reshape data for logistic regressioncomplaints_tibble <- complaints_tibble %>%mutate(binary_dispute =case_when( Consumer.disputed. =="Yes"~1, Consumer.disputed. =="No"~0,.default =NA ))emotion_wide <- complaints_tibble %>%count(Complaint.ID,binary_dispute, sentiment) %>%pivot_wider(id_cols =c(Complaint.ID, binary_dispute),names_from = sentiment,values_from = n,values_fill =0 )```**Run model**```{r, eval=F}# Run logistic regression# table(emotion_wide$binary_dispute, useNA = "always")dispute_model <- glm(binary_dispute ~ anger + fear + joy + sadness + trust + surprise + anticipation + disgust, data = emotion_wide, family = "binomial")# View model summarysummary(dispute_model)```Significant predictors are`joy` (p = 0.000157): negative relationship`sadness` (p = 0.002610): positive relationship`trust` (p = 1.15e-08): positive relationship`surprise` (p = 0.010196): negative relationship`anticipation` (p = 0.004157): positive relationship## Validate model with Chi-Squared Test ```{r}# chisq_test <- anova(dispute_model, test = "Chisq")# print(chisq_test)```Significant predictors areAnger (p = 2.700e-11)Fear (p = 3.797e-05)Sadness (p = 1.841e-06)Trust (p = 1.888e-11)Surprise (p = 0.029413)Anticipation (p = 0.004549)Joy is significant in the coefficient test but not in the sequential test, suggesting it may share explanatory power with variables added earlierAnger is significant in the sequential test but not in the coefficient test# Final Suggestions* Focus on sadness, trust, surprise and anticipation as your primary findings since they are significant in both tests* Acknowledge joy as potentially important since it's significant when controlling for all variables* Consider whether to include anger based on your research question and theoretical framework# R script[Link](https://github.com/huedenke/project/blob/main/Clean%20R%20Code.R)