Code
# loading packages
library(tidyverse)
library(knitr)
library(ggthemes)
library(ggrepel)
library(dslabs)
Slides: slides.html ( Go to slides.qmd
to edit)
Remember: Your goal is to make your audience understand and care about your findings. By crafting a compelling story, you can effectively communicate the value of your data science project.
Carefully read this template since it has instructions and tips to writing!
In today’s digital landscape, customer support conversations increasingly take place over chat and social media platforms. These short-form exchanges are often emotionally charged and can signal a customer’s satisfaction, frustration, or potential escalation. Understanding the emotional tone behind these messages is critical for improving service quality, anticipating customer needs, and enhancing the overall customer experience. Yet, analyzing this kind of shorthand-heavy language presents a unique challenge for traditional sentiment analysis models.
This project explores how VADER (Valence Aware Dictionary for sEntiment Reasoning), a lexicon and rule-based sentiment analysis tool, can be used to classify the tone of real customer support messages (Hutto and Gilbert 2014). Unlike deep learning models, VADER is lightweight, interpretable, and tuned for informal language like that found in chats and social media channels. It captures nuances in language through lexical scoring that accounts for punctuation, capitalization, and emojis, making it especially suited to these use cases (Barik and Misra 2024).
To build a full machine learning pipeline around VADER, we will use its sentiment scores (positive, neutral, negative) as labels and train an XGBoost classifier using TF-IDF features extracted from the message text. XGBoost is well-suited for this task because it performs efficiently with sparse, high-dimensional data and eliminates the need to hand-label messages or train a separate sentiment model from scratch.
The dataset selected for this project is the “Customer Support on Twitter” dataset from Kaggle, which contains real-world support interactions between users and brands such as Apple, Amazon, and Comcast. The messages are short, informal, and emotionally expressive—closely mirroring real-world customer support scenarios—and make the dataset ideal for sentiment analysis and predictive modeling.
Natural Language Processing (NLP) has become a vital tool for understanding customer sentiment across digital platforms. A variety of approaches have been proposed in the literature, from lexicon-based models such as VADER to machine learning methods like XGBoost. This review highlights the studies that informed the methodological design of our project.
The foundation of our sentiment scoring approach is VADER, a rule-based model that excels at detecting sentiment in informal, short-form text such as tweets and chat messages (Hutto and Gilbert 2014). VADER’s robustness to capitalization, punctuation, and emoji usage makes it particularly well-suited for analyzing customer service conversations on social media.
Recent research continues to support and expand on VADER’s use. Barik and Misra (Barik and Misra 2024) evaluated an improved VADER lexicon in analyzing e-commerce reviews and emphasized its interpretability and processing speed. Chadha and Aryan (Chadha and Aryan 2023) also confirmed VADER’s reliability in sentiment classification tasks, noting its effectiveness in fast-paced business contexts. Youvan (Youvan 2024) offered a comprehensive review of VADER’s core logic, highlighting its treatment of intensifiers, negations, and informal expressions.
To complement VADER’s labeling, we incorporate XGBoost, an efficient and scalable gradient boosting algorithm, as a supervised classifier. Lestari et al. (Lestari et al. 2025) compared XGBoost with AdaBoost for movie review classification and found XGBoost achieved higher accuracy and generalizability. Sefara and Rangata (Sefara and Rangata 2024) also found XGBoost to be the most effective model for classifying domain-specific tweets, outperforming Logistic Regression and SVM in both performance and efficiency. Lu and Schelle (Lu and Schelle 2025) demonstrated how XGBoost could be used to extract interpretable feature importance from tweet sentiment, providing additional value for insights and decision-making.
Detail the models or algorithms used.
Justify your choices based on the problem and data.
The common non-parametric regression model is \(Y_i = m(X_i) + \varepsilon_i\), where \(Y_i\) can be defined as the sum of the regression function value \(m(x)\) for \(X_i\). Here \(m(x)\) is unknown and \(\varepsilon_i\) some errors. With the help of this definition, we can create the estimation for local averaging i.e. \(m(x)\) can be estimated with the product of \(Y_i\) average and \(X_i\) is near to \(x\). In other words, this means that we are discovering the line through the data points with the help of surrounding data points. The estimation formula is printed below (R-base?):
\[ M_n(x) = \sum_{i=1}^{n} W_n (X_i) Y_i \tag{1} \]\(W_n(x)\) is the sum of weights that belongs to all real numbers. Weights are positive numbers and small if \(X_i\) is far from \(x\).
Another equation:
\[ y_i = \beta_0 + \beta_1 X_1 +\varepsilon_i \]
Describe your data sources and collection process.
Present initial findings and insights through visualizations.
Highlight unexpected patterns or anomalies.
A study was conducted to determine how…
# loading packages
library(tidyverse)
library(knitr)
library(ggthemes)
library(ggrepel)
library(dslabs)
# Load Data
kable(head(murders))
state | abb | region | population | total |
---|---|---|---|---|
Alabama | AL | South | 4779736 | 135 |
Alaska | AK | West | 710231 | 19 |
Arizona | AZ | West | 6392017 | 232 |
Arkansas | AR | South | 2915918 | 93 |
California | CA | West | 37253956 | 1257 |
Colorado | CO | West | 5029196 | 65 |
= murders %>% ggplot(mapping = aes(x=population/10^6, y=total))
ggplot1
+ geom_point(aes(col=region), size = 4) +
ggplot1 geom_text_repel(aes(label=abb)) +
scale_x_log10() +
scale_y_log10() +
geom_smooth(formula = "y~x", method=lm,se = F)+
xlab("Populations in millions (log10 scale)") +
ylab("Total number of murders (log10 scale)") +
ggtitle("US Gun Murders in 2010") +
scale_color_discrete(name = "Region")+
theme_bw()
Explain your data preprocessing and cleaning steps.
Present your key findings in a clear and concise manner.
Use visuals to support your claims.
Tell a story about what the data reveals.
Summarize your key findings.
Discuss the implications of your results.