Tone Matters: Sentiment Classification of Support Tweets Using VADER and XGBoost

Samantha Chickeletti & Michael Alfrey (Advisor: Dr. Cohen)

2025-08-05

Introduction

Motivation and Context
Our Approach: VADER and XGBoost
Dataset Overview
Methodological Foundations

Motivation and Context: Why Tone Matters

In today’s digital landscape, customer support conversations increasingly take place over chat and social media platforms. These short-form exchanges are often emotionally charged and can signal a customer’s satisfaction, frustration, or potential escalation.

Why does tone matter?

Understanding the emotional tone behind these messages can:

Improving service quality
Anticipating customer needs
Enhance the overall customer experience

Yet, analyzing this kind of shorthand-heavy language presents a unique challenge for traditional sentiment analysis models. Our project explores whether we can automate tone detection in customer support tweets using a lightweight, interpretable machine learning pipeline.

Our Approach: VADER + XGBoost

We combined two powerful techniques:

VADER (Valence Aware Dictionary for sEntiment Reasoning): A lexicon-based sentiment analysis tool, can classify tone in real customer support messages (Hutto and Gilbert 2014). It prioritizes speed and interpretability, making it ideal for short, informal content like tweets. VADER’s scoring mechanism is particularly sensitive to social media features such as emojis, capitalization, and punctuation, which are often critical to conveying tone in these environments (Barik and Misra 2024).
XGBoost(eXtreme Gradient Boost): To build a full machine learning pipeline around VADER, we will use its sentiment scores labels and train an XGBoost classifier using TF-IDF features extracted from the message text. XGBoost is well-suited for this task because it performs efficiently with sparse, high-dimensional data and eliminates the need to hand-label messages or train a separate sentiment model from scratch.

This strategy bridges rule-based interpretability with machine learning accuracy without requiring manual sentiment labels.

Dataset Overview

We used the Customer Support on Twitter dataset (Kaggle):

2.8M tweets exchanged between brands like Apple, Amazon, and their customers
Informal, emotionally expressive language mirrors real-world chat scenarios

This dataset is ideal for sentiment analysis and predictive modeling.

Lexicon-Based Methods: Why VADER?

Designed for short, informal text like tweets VADER is particularly effective because it incorporates key linguistic signals such as:

Capitalization -> (e.g., “AWESOME”)
Punctuation -> (e.g., “!!!”)
Emojis -> (e.g., “:)”)
Negations -> (e.g., “not good” → negative)

These elements help capture the nuanced sentiment found in customer service conversations that traditional lexicon models may often miss.

Machine Learning Methods: Why XGBoost?

Limitations of Lexicons:

While VADER is powerful, it’s limited to its predefined lexicon and rule set. To complement VADER’s labeling, we incorporate XGBoost, an efficient and scalable gradient boosting algorithm, as a supervised classifier.

Why XGBoost:

Addresses VADER’s limitations (e.g., static lexicon, rule-bound)
Performs well on high-dimensional sparse text (TF-IDF)
Allows for interpretability & scalability

Methods

-VADER for initial sentiment labeling

-Representing Text as Features with TF-IDF

-XGBoost for supervised classification

-Evaluated with Accuracy, Precision, Recall, and F1

VADER for initial sentiment labeling

Removed URLs, mentions, hashtags for cleaner input
Applied VADER to assign compound sentiment scores
\[ \text{compound score} = \frac{\sum_{i=1}^{n} s_i}{\sqrt{\sum_{i=1}^{n} s_i^2} + \alpha} \]
Labeled tweets using VADER thresholds:
- Positive ≥ 0.05
- Neutral between -0.05 and 0.05
- Negative ≤ -0.05

This automated labeling process served as the backbone for our supervised classification model.

VADER in Action

For example, a tweet reading:

“I’ve been delayed over an HOUR and STILL no response… this is ridiculous!!!”

Example VADER Scoring:

Feature	Detected Element	VADER Response	Score Impact
Capitalization	“HOUR”, “STILL”	Increases intensity	-0.10
Punctuation	“…” and “!!!”	Amplifies negative sentiment	-0.25
Lexicon Match	“ridiculous”	Strong negative valence	-0.25
Overall Tone	Complaint/frustration	Negative valence	-0.15

This tweet produces a Final Compound score of -0.75 and is labeled as negative.

Representing Text as Features with TF-IDF

To convert tweets into numerical features for modeling, we employ Term Frequency–Inverse Document Frequency (TF-IDF), a technique that quantifies how important each word is within the context of both the individual tweet and the overall corpus.

Term Frequency (TF) measures how often a word appears in a single tweet (i.e., domain) relative to the total number of words in that tweet:

\[ \text{TF}_{w_n} = \frac{g_{w_n}^{d_m}}{T_{d_m}} \]

Where:
• \(w_n\) is the \(n^{\text{th}}\) word in domain \(d_m\) (a tweet)
• \(g_{w_n}^{d_m}\) is the number of times word \(w_n\) occurs in domain \(d_m\)
• \(T_{d_m}\) is the total number of words in domain \(d_m\)

TF-IDF in Action

If the word delay appears twice in a 50-word tweet, its term frequency is:

\[ \text{TF}_{w_n} = \frac{2}{50} = 0.04 \]

Inverse Document Frequency (IDF) evaluates how unique or informative a word is across the full set of tweets. Common words receive lower IDF scores, while rare or distinctive words receive higher scores:

\[ \text{IDF}_{w_n} = \log\left(\frac{T_{d_m}}{N_{w_n}}\right) \]

Where:
• \(N_{w_n}\) is the number of documents that contain word \(w_n\)

Example:
If delay appears in 5 out of 500,000 tweets, its IDF will be much higher than that of hello, which may appear in 10,000 tweets.

Final TF-IDF Weighting

Finally, TF-IDF combines these two metrics to weight each word by how frequently it appears in a tweet and how rare it is across the full dataset:

\[ \text{TF-IDF}_{w_n} = \text{TF}_{w_n} \times \text{IDF}_{w_n} \] So why TF-IDF?

Emotionally charged or unique words
Generates a sparse matrix ideal for boosting models

This process highlights terms that are both prominent in a tweet and distinctive across the dataset, making TF-IDF a powerful and interpretable technique for feature extraction in sentiment analysis pipelines (Barik and Misra 2024).

eXtreme Gradient Boosting (XGBoost)

XGBoost (eXtreme Gradient Boosting), is a scalable and regularized tree ensemble algorithm designed for both accuracy and efficiency. It builds an additive model by iteratively constructing decision trees that minimize a regularized objective function, which balances prediction accuracy with model simplicity.

The objective consists of two components:

A convex loss function that measures how well the model fits the data
A regularization term that penalizes overly complex trees

XGBoost Prediction Equation

Each predicted class label \(\hat{y}_i\) is computed as the sum of outputs from \(K\) trees: \[ \hat{y}_i = \phi(x_i) = \sum_{k=1}^K f_k(x_i), \quad f_k \in \mathcal{F} \] Where:

\(x_i\): Input TF-IDF vector for tweet \(i\)
\(f_k(x_i)\): Prediction from \(k^{\text{th}}\) tree
\(\phi(x_i)\): Combined prediction
This formula is foundational to XGBoost.

In classifying sentiment labels, the accumulated scores are passed through a softmax functions to determine class probabilities.

Classifying a Tweet with XGBoost

Assume we use 3 boosting rounds (trees) per class.

For a new tweet \(x_i\):

Positive score: \(1.2 + 0.9 + 1.1 = 3.2\)
Neutral score: \(0.5 + 0.6 + 0.3 = 1.4\)
Negative score: \(0.8 + 0.7 + 0.6 = 2.1\)

Predicted label: Positive, because it has the highest total score.

Balancing Accuracy and Complexity

Objective Function:

\[ \mathcal{L}(\phi) = \sum_{i} l(\hat{y}_i, y_i) + \sum_{k} \Omega(f_k) \]

Regularization Term:

\[ \Omega(f_k) = \gamma T + \frac{1}{2} \lambda \| w \|^2 \]

Component Breakdown:

\(l(\hat{y}_i, y_i)\): Softmax loss for multiclass classification
\(f_k\): The \(k^{\text{th}}\) decision tree
\(T\): Number of leaves in the tree
\(w\): Vector of leaf weights
\(\gamma\), \(\lambda\): Regularization parameters

This structure balances model fit and complexity — helping prevent overfitting and improve generalization.

Pipeline Overview

Evaluation Metrics

To understand how well our model performed, we used four core metrics:

Accuracy: Overall correct predictions.
Precision: The proportion of correct predictions among all tweets the model labeled as a given class.
Recall: The proportion of actual sentiment instances that were correctly identified.
F1 Score: The harmonic mean of precision and recall.

Class Imbalance:

These metrics were selected to account for class imbalance, which is common in sentiment data sets.

For instance, positive tweets dominate the volume, while negative tweets are more operationally important in customer service. Therefore, we paid close attention to class-specific precision and recall, especially for the negative class, crucial for identifying frustrated users.

Analysis and Results

Data Exploration
Preprocessing & Sentiment Labeling
TF-IDF and Feature Vectorization
Sentiment Classifier
Overall Performance
Key Takeaways

Data Exploration

Dataset: Customer Support on Twitter (Kaggle)

2.8M tweets exchanged between major companies and customers
Informal, emotionally expressive, real-world support conversations

Fields used:

‘text’: Tweet content
‘inbound’: Identifies if the tweet is from a customer (TRUE) or company (FALSE)

Note: Metadata like threading was available but excluded from modeling

Preprocessing & Sentiment Labeling

Removed: URLs, mentions, hashtags
Used VADER to assign compound sentiment scores
Sentiment Distribution (VADER Labels):
- Positive: 51.7%
- Neutral: 24.6%
- Negative: 23.7%

Surprisingly, we found a high percentage of positive tweets contradicting our initial expectation to find most support messages to be complaints or frustration.

Tone Differences by Sentiment

Positive: Thank, help, happy
Neutral: DM, issue, Hi
Negative: sorry, problem, now

Figure: Word clouds by sentiment class

TF-IDF and Feature Vectorization

Preparing for Modeling

Cleaned again for vectorization by lowercasing all text and then removing:
- stopwords,
- special characters,
- URLs,
- mentions,
- and hashtags
Applied TF-IDF (n-grams up to 2 words, 5,000 features)
Resulting in a sparse matrix well-suited for XGBoost

TF-IDF and Feature Vectorization

Top 20 Most Informative Terms

Insights

TF-IDF helps us identify which words are not just common, but actually important for telling tweets apart. It’s like finding the loudest voices in a crowded room.
Words like ‘dm’, ‘help’, ‘thanks’, ‘sorry’, and ‘account’ highlight the nature of support conversations—many of which are requests for assistance, apologies, or follow-ups.
These high-weighted features help the XGBoost model detect tone and intent without needing deep semantic understanding. For example, the presence of words like ‘sorry’ and ‘delay’ may signal negative sentiment, while ‘thanks’ or ‘hi’ may suggest a positive or neutral interaction.

Building Our Sentiment Classifier

Classifier: We used XGBoost
Training data: 150,000 tweets (50k per class, stratified)
Test set: 562,355 tweets
Feature Input: TF-IDF matrix (5,000 n-grams)
Class Weighting: Applied to improve recall for Negative tweets
Objective: Softmax multiclass loss with regularization
Output: Class label (Positive, Neutral, Negative)

Overall Performance

Final Model Results

Metric	Value
Accuracy	77.1%
Precision	80.96%
Recall	77.1%
F1 Score	77.45%

Balanced performance, with emphasis on improving recall for negative sentiment

Class-Level Performance

Sentiment	Precision	Recall	F1 Score
Negative	0.74	0.68	0.71
Neutral	0.62	0.95	0.75
Positive	0.93	0.73	0.82

High recall for Neutral (templated replies)
High precision for Positive
Improved recall for Negative from 64% → 68%

A key goal: identifying dissatisfaction more reliably.

Recall on Negative tweets improved from 64% → 68% using sample weighting

Impact, Limitations, and Future Work

Business Impact

Flags high-risk conversations in real-time
Service quality tracking
Enhance escalation workflows

Limitations:

Trained on Twitter, performance may vary on email or chat
No conversation context
Struggles with sarcasm
VADER lexicon is static

Future Work:

Add threading/context
Explore LLMs or deep learning
Expand to multilingual support

Conclusion

This project successfully demonstrated a scalable approach to sentiment classification in customer support conversations by combining VADER, TF-IDF, and XGBoost. VADER provided fast and interpretable sentiment labels tailored for informal social media language, which we used to train a high-performing supervised classifier.

Achieved:

77% accuracy
0.71 F1 Score for Negative tone
Fast, scalable tone detection

By automating tone detection in real-time support channels, this framework offers immediate business value. It can help teams prioritize escalations, identify service bottlenecks, and monitor agent interactions at scale. Our findings confirm that interpretable, rule-based sentiment scoring (via VADER) can be successfully integrated with machine learning to support responsive, tone-aware customer engagement.

References

Barik, Kanhu, and Sanghamitra Misra. 2024. “Analysis of Customer Reviews with an Improved VADER Lexicon Classifier.” Journal of Big Data 11: 10. https://doi.org/10.1186/s40537-023-00861-x.

Hutto, C. J., and Eric Gilbert. 2014. “VADER: A Parsimonious Rule-Based Model for Sentiment Analysis of Social Media Text.” In Proceedings of the International AAAI Conference on Web and Social Media, 8:216–25. 1. https://doi.org/10.1609/icwsm.v8i1.14550.