Tone Matters: Sentiment Classification of Support Tweets Using VADER and XGBoost
Slides: slides.html ( Go to slides.qmd
to edit)
Introduction
In today’s digital landscape, customer support conversations increasingly take place over chat and social media platforms. These short-form exchanges are often emotionally charged and can signal a customer’s satisfaction, frustration, or potential escalation. Understanding the emotional tone behind these messages is critical for improving service quality, anticipating customer needs, and enhancing the overall customer experience. Yet, analyzing this kind of shorthand-heavy language presents a unique challenge for traditional sentiment analysis models.
This project explores how VADER (Valence Aware Dictionary for sEntiment Reasoning), a lexicon-based sentiment analysis tool, can classify tone in real customer support messages (Hutto and Gilbert 2014). Its design prioritizes speed and interpretability, making it ideal for short, informal content like tweets and chat messages. VADER’s scoring mechanism is particularly sensitive to social media features such as emojis, capitalization, and punctuation, which are often critical to conveying tone in these environments (K. Barik and Misra 2024).
To build a full machine learning pipeline around VADER, we will use its sentiment scores as labels and train an XGBoost classifier using TF-IDF features extracted from the message text. XGBoost is well-suited for this task because it performs efficiently with sparse, high-dimensional data and eliminates the need to hand-label messages or train a separate sentiment model from scratch.
The dataset selected for this project is the “Customer Support on Twitter” dataset from Kaggle, which contains real-world support interactions between users and brands such as Apple, Amazon, and Comcast. The messages are short, informal, and emotionally expressive—closely mirroring real-world customer support scenarios—and make the dataset ideal for sentiment analysis and predictive modeling.
Natural Language Processing (NLP) has become a vital tool for understanding customer sentiment across digital platforms. A variety of approaches have been proposed in the literature, from lexicon-based models such as VADER to machine learning methods like XGBoost. The literature review highlights the studies that informed and influence the methodological design of our project.
Literature Review
Lexicon-Based Sentiment Analysis
Lexicon based methods remain a powerful choice for analyzing short, informal messages. VADER is particularly effective because it incorporates key linguistic signals such as:
- Capitalization: (e.g.,
"AWESOME"
→ increases intensity) - Punctuation: (e.g.,
!
→ amplifies sentiment) - Slang, emojis, and emoticons: (e.g.,
:)
→ amplifies sentiment) - Negation: (e.g.,
"not good"
→ polarity reversal)
These elements help capture the nuanced sentiment found in customer service conversations that traditional lexicon models may often miss.
Recent research continues to support and expand on VADER’s use. Barik and Misra (K. Barik and Misra 2024) evaluated an improved VADER lexicon in analyzing e-commerce reviews and emphasized its interpretability and processing speed. Chadha and Aryan (Chadha and Aryan 2023) also confirmed VADER’s reliability in sentiment classification tasks, noting its effectiveness in fast-paced business contexts. Youvan (Youvan 2024) offered a comprehensive review of VADER’s core logic, highlighting its treatment of intensifiers, negations, and informal expressions. Together, these studies reinforced our decision to use VADER as the foundation for our sentiment labeling.
Machine Learning for Sentiment Classification
While VADER is powerful, it’s limited to its predefined lexicon and rule set. To complement VADER’s labeling, we incorporate XGBoost, an efficient and scalable gradient boosting algorithm, as a supervised classifier. Lestari et al. (Lestari et al. 2025) compared XGBoost with AdaBoost for movie review classification and found XGBoost achieved higher accuracy and generalizability. Sefara and Rangata (Sefara and Rangata 2024) also found XGBoost to be the most effective model for classifying domain-specific tweets, outperforming Logistic Regression and SVM in both performance and efficiency. Lu and Schelle (Lu and Schelle 2025) demonstrated how XGBoost could be used to extract interpretable feature importance from tweet sentiment, providing a compelling case for our approach. With these foundations in place, we now detail the methodology that guided our implementation.
Methods
Pipeline illustration
Preprocessing and Sentiment Labeling with VADER
Before applying VADER, our process began by cleaning the raw tweet text to ensure consistency. We removed URLs, user mentions, and hashtags. While VADER can handle informal text, this step was performed to improve text uniformity and prepare for downstream modeling. After cleaning, we applied VADER to generate a compound sentiment score for each tweet and label tweets as Positive, Neutral, or Negative based on standardized thresholds. The compound sentiment score is computed as:
\[ \text{compound score} = \frac{\sum_{i=1}^{n} s_i}{\sqrt{\sum_{i=1}^{n} s_i^2} + \alpha} \]
Where \(s_i\) is the sentiment score for each word or token and \(\alpha\) is a normalization constant (typically set to 15).
This gives us a single score between -1 and +1 that reflects the overall sentiment of the message. The final sentiment labels are then assigned using the following thresholds:
- Positive if compound ≥ 0.05
- Neutral if -0.05 < compound < 0.05
- Negative if compound ≤ -0.05
This automated labeling process served as the backbone for our supervised classification model.
To illustrate how VADER works in action, here’s an example tweet:
“I’ve been delayed over an HOUR and STILL no response… this is ridiculous!!!”
Example VADER Scoring:
Feature | Detected Element | VADER Response | Score Impact |
---|---|---|---|
Capitalization | “HOUR”, “STILL” | Increases intensity | -0.10 |
Punctuation | “…” and “!!!” | Amplifies negative sentiment | -0.25 |
Lexicon Match | “ridiculous” | Strong negative valence | -0.25 |
Overall Tone | Complaint/frustration | Strongly negative | -0.15 |
Final Compound | -0.75 |
Notice how VADER captures multiple tone signals in this short sentence. This tweet produces a compound score of –0.75 which clearly crosses the negative threshold. So VADER would label this tweet as Negative.
By relying on VADER instead of manual annotation, we create a foundation for downstream supervised learning. This aligns with findings by Lu (2025), who demonstrated that VADER-labeled tweets combined with TF-IDF and XGBoost achieved performance comparable to manually labeled datasets (Lu and Schelle 2025). Next, we turn to feature extraction, to transform our labeled text into numerical form suitable for machine learning.
Term Frequency–Inverse Document Frequency (TF-IDF)
To convert tweets into numerical features for modeling, we employ Term Frequency–Inverse Document Frequency (TF-IDF), a technique that quantifies how important each word is within the context of both the individual tweet and the overall corpus.
Term Frequency (TF) measures how often a word appears in a single tweet (i.e., domain) relative to the total number of words in that tweet:
\[ \text{TF}_{w_n} = \frac{g_{w_n}^{d_m}}{T_{d_m}} \]
Where:
• \(w_n\) is the \(n^{\text{th}}\) word in domain \(d_m\) (a tweet)
• \(g_{w_n}^{d_m}\) is the number of times word \(w_n\) occurs in domain \(d_m\)
• \(T_{d_m}\) is the total number of words in domain \(d_m\)
Example:
If the word delay appears twice in a 50-word tweet, its term frequency is:
\[ \text{TF}_{w_n} = \frac{2}{50} = 0.04 \]
Inverse Document Frequency (IDF) evaluates how unique or informative a word is across the full set of tweets. Common words receive lower IDF scores, while rare or distinctive words receive higher scores:
\[ \text{IDF}_{w_n} = \log\left(\frac{T_{d_m}}{N_{w_n}}\right) \]
Where:
• \(N_{w_n}\) is the number of documents that contain word \(w_n\)
Example:
If delay appears in 5 out of 500,000 tweets, its IDF will be much higher than that of hello, which may appear in 10,000 tweets.
Finally, TF-IDF combines these two metrics to weight each word by how frequently it appears in a tweet and how rare it is across the full dataset:
\[ \text{TF-IDF}_{w_n} = \text{TF}_{w_n} \times \text{IDF}_{w_n} \]
This process highlights terms that are both prominent in a tweet and distinctive across the dataset, making TF-IDF a powerful and interpretable technique for feature extraction in sentiment analysis pipelines (K. Barik and Misra 2024). With our feature matrix ready, we proceeded to modeling.
eXtreme Gradient Boosting (XGBoost)
To model sentiment classifications based on TF-IDF features, we employ XGBoost (eXtreme Gradient Boosting), a scalable and regularized tree ensemble algorithm designed for both accuracy and efficiency. XGBoost builds an additive model by iteratively constructing decision trees that minimize a regularized objective function, which balances prediction accuracy with model simplicity. The objective consists of two components: a convex loss function that measures how well the model fits the data, and a regularization term that penalizes overly complex trees.
Each predicted class label \(\hat{y}_i\) (positive, neutral, negative) is computed as the sum of outputs from \(K\) trees:
\[
\hat{y}_i = \phi(x_i) = \sum_{k=1}^K f_k(x_i), \quad f_k \in \mathcal{F}
\]
Where:
• \(x_i\): The input TF-IDF vector for tweet \(i\)
• \(f_k(x_i)\): The prediction from the \(k^\text{th}\) tree for input \(x_i\)
• \(\sum_{k=1}^K f_k(x_i)\): The sum of predictions for each class
• \(\phi(x_i)\): The combined prediction from all trees
This formula is foundational to XGBoost. It expresses how the final prediction is built up iteratively from multiple decision trees, which is the basis of boosting. In classifying sentiment labels, the accumulated scores are passed through a softmax functions to determine class probabilities.
Example:
Suppose we are using XGBoost to classify the sentiment of a tweet as positive, neutral, or negative, and the model has been trained with \(K\) = 3 boosting rounds (trees) per class.
For a new input tweet \(x_i,\) each of the 3 trees for each class outputs a score which is then summed for each class:
• Positive class score: 1.2 + 0.9 + 1.1 = 3.2
• Neutral class score: 0.5 + 0.6 + 0.3 = 1.4
• Negative class score: 0.8 + 0.7 + 0.6 = 2.1
Since the positive class has the highest total score, the model assigns the label positive.
Once prediction scores are computed, XGBoost must also determine how to train itself to make better predictions through the process of learning optimal tree structure. This is done by minimizing the regularized objective function, which balances prediction accuracy and model complexity:
\[
\mathcal{L}(\phi) = \sum_{i} l(\hat{y}_i, y_i) + \sum_{k} \Omega(f_k)
\] \[
\text{where }\Omega(f) = \gamma T + \frac{1}{2} \lambda \lVert w \rVert^2
\]
• \(l(y_i, \hat{y}_i)\) is our differentiable convex loss function (softmax loss for multiclass classification), measuring how far off the model’s prediction \(\hat{y}_i\) is from the true label \(y_i\).,
• \(f_k\) is the \(k^\text{th}\) decision tree in the ensemble,
• \(T\): the number of leaves on a tree,
• \(w\): the vector of leaf scores (weights),
• \(\gamma\) and \(\lambda\): regularization parameters that control tree complexity.
Therefore, by combining a strong predictive loss with a tree-specific complexity penalty, XGBoost is able to generalize well to new data, outperforming simpler models while remaining computationally efficient (Chen and Guestrin 2016). It also provides feature importance scores, offering insights into which terms most influence predictions—a valuable asset for customer service teams seeking actionable feedback. Now with the model trained, we evaluated its performance using several classification metrics.
Evaluation Metrics
To understand how well our model performed, we used four core metrics:
- Accuracy: The proportion of correct predictions.
- Precision: The proportion of correct predictions among all tweets the model labeled as a given class.
- Recall: The proportion of actual sentiment instances that were correctly identified.
- F1 Score: The harmonic mean of precision and recall.
These metrics were selected to account for class imbalance, which is common in sentiment data sets. For instance, positive tweets dominated in our dataset, but negative tweets are more operationally important in customer service. Therefore, we paid close attention to class-specific precision and recall, especially for the negative class, to ensure that frustrated customer messages were identified without over-triggering on neutral ones (R. Barik and Misra 2024; Gandy and Smith 2025).
Analysis and Results
Data Exploration and Visualization
Following the development of our preprocessing and modeling pipeline, we analyzed how our system performed and what the data revealed when applied to real-world support tweets. Our goal was to build a model that can classify the sentiment of a support tweet using its text. We focused on this problem because understanding tone in real time could help companies:
- Prioritize angry customers
- Track service tone trends over time
- And even alert managers to potential PR issues
To conduct this analysis, we used the Customer Support on Twitter dataset from Kaggle, which contains over 2.8 million tweets exchanged between customers and major companies such as Apple, Amazon, and Comcast. Each record includes:
- tweet_id (unique identifier)
- author_id (sender identifier)
- created_at (timestamp)
- text (tweet content)
- inbound (flag customer = TRUE or company = FALSE)
- response_tweet_id and in_response_to_tweet_id (for tracking conversation threads)
Data Overview
Initial Data Exploration and Cleaning
A review of the data set revealed that while core fields like text, created_at, and inbound had no missing values, over one million records were missing response_tweet_id, and over 700,000 were missing in_response_to_tweet_id. These were not relevant to the sentiment classification task or used in downstream modeling so they were excluded from modeling.
For pre-processing, the tweet text was cleaned by removing URLs, mentions, hashtags. The tweets were then labeled using VADER sentiment analyzer, to assign a compound score ranging from -1 (most negative) to +1 (most positive).
Sentiment Distribution
After labeling, we used VADER thresholds to categorized each tweet as Positive, Neutral, or Negative. The resulting sentiment distribution was:
- Positive: 51.7%
- Neutral: 24.6%
- Negative: 23.7%
Interestingly, we found a high percentage of positive tweets contradicting our initial expectation to find most support messages to be complaints or frustration. Upon further review, it appears many customers tweet to thank support agents after resolution and predefined responses from companies (“Please send us a DM”) likely contribute to a more neutral and positive tone due to their polite structure as you can see in the figure below.
Tone Differences by Sentiment
To explore vocabulary trends across sentiment classes, we generated word clouds for each category.
- Positive tweets frequently used terms such as “thank”, “help”, “please”.
- Neutral tweets often included formal language or template replies like “DM” or “issue”.
- Negative tweets featured words such as “sorry,” “problem,” “now”.
These linguistic cues confirmed that users often use strong tonal indicators in emotionally charged messages and brands employ consistent phrasing in templated outbound communication.
Modeling Pipeline Performance
After labeling each tweet with a sentiment category using VADER, we prepared the data set for supervised classification by extracting numerical features using TF-IDF. In order to tokenize and vectorize the text, we had to preprocess the text again, this time removing all special characters, lowercasing all text, and removing stop words in addition to what was already removed for VADER. We used TF-IDF with n-grams(up to 2 words) and a vocabulary size capped at 5,000 features to transform the text into sparse numeric vectors. The final TF-IDF matrix was sparse and high-dimensional, well-suited for XGBoost, our classifier of choice. Sentiment labels were then encoded, and we trained XGBoost on stratified, balanced sample of 150,000 tweets (50k per class) to address class imbalance.
Top 20 Most Informative Terms
We split the dataset into 80% training and 20% testing subsets, maintaining class proportions with stratification. To improve recall on Negative tweets (a minority class), we applied sample weighting based on class distribution.
Final Results:
The final model was evaluated on a test set of 562,355 tweets. The table below summarizes the overall classification performance:
Metric | Value |
---|---|
Accuracy | 77.10% |
Precision | 80.96% |
Recall | 77.10% |
F1-Score | 77.45% |
These results reflect strong generalization to unseen data, with high overall precision and a balanced trade-off between recall and accuracy.
Per-Class Performance:
Sentiment | Precision | Recall | F1-Score |
---|---|---|---|
Negative | 0.74 | 0.68 | 0.71 |
Neutral | 0.62 | 0.95 | 0.75 |
Positive | 0.93 | 0.73 | 0.82 |
At the class level, the model demonstrated robust performance across all three sentiment classes, with especially high recall for neutral tweets likely driven by their templated and predictable language patterns. Positive tweets achieved the highest precision, indicating strong classifier confidence when identifying praise, resolution acknowledgments, or gratitude.
Most notably, recall on the Negative class improved from 64% to 68% following the application of class weighting during model training. This was a key design objective, as correctly flagging negative sentiment is vital for identifying dissatisfied customers and prioritizing intervention. Although this shift came with a marginal decrease in overall accuracy, it resulted in better class balance, improving the model’s practical utility in a customer support context.
Confustion Matrix Analysis
The confusion matrix shows that the model most accurately classified positive sentiment, the majority class in the dataset. However, it also achieved reasonable precision and recall for the Negative class, confirming its effectiveness at detecting frustration and dissatisfaction, both critical signals in operational environments.
Conclusion
This study demonstrates the effectiveness of combining rule-based sentiment scoring (VADER) with supervised machine learning techniques (TF-IDF vectorization and XGBoost) to build a robust, interpretable, and scalable sentiment classification model tailored for informal customer support conversations on social media. VADER enabled fast sentiment labeling without manual annotation, serving as the foundation for model supervision. By converting tweet text into high-dimensional, sparse feature vectors using TF-IDF and training using an XGBoost classifier, we achieved high performance across evaluation metrics — 77.1% accuracy, 80.9% precision, and an F1-score of 77.4%. Notably, the model improved recall on the Negative class after applying class weights, enhancing the model’s ability to flag frustrated customers — critical in customer support where failure to flag dissatisfaction can be costly.
Implications
By automating tone detection in real-time support channels, this framework offers immediate business value by enabling:
- Automatically flag high-risk interactions for escalation or human intervention.
- Support quality monitoring, performance reviews, or workflow improvements based on interaction tone over time.
Limitations / Future Work
Despite its strengths, the system has a few limitations:
- Ambiguity and Sarcasm: Tweets lacking explicit emotional language or containing sarcasm can confuse both VADER and XGBoost.
- Context Awareness: The model analyzes each tweet, without conversation context or threading.
- Static Lexicon: VADER does not learn from context, restricting its adaptability to evolving internet slang or brand-specific interactions without manual updates.
To address these challenges, future enhancements could include:
- Handling sarcasm or ambiguous sentiment using deep learning models.
- Incorporating metadata (e.g., brand, time of day, response time)
- Expanding to multilingual support for broader application.
Together, these enhancements would further elevate this framework into an actionable solution for tone detection and sentiment-driven customer experience optimization.
Note: Some parts of this project were assisted by ChatGPT for writing support and citation formatting. All content was reviewed and edited by the authors to ensure accuracy and originality.