2025-08-05
In today’s digital landscape, customer support conversations increasingly take place over chat and social media platforms. These short-form exchanges are often emotionally charged and can signal a customer’s satisfaction, frustration, or potential escalation.
Why does tone matter?
Understanding the emotional tone behind these messages can:
Yet, analyzing this kind of shorthand-heavy language presents a unique challenge for traditional sentiment analysis models. Our project explores whether we can automate tone detection in customer support tweets using a lightweight, interpretable machine learning pipeline.
We combined two powerful techniques:
VADER (Valence Aware Dictionary for sEntiment Reasoning): A lexicon-based sentiment analysis tool, can classify tone in real customer support messages (Hutto and Gilbert 2014). It prioritizes speed and interpretability, making it ideal for short, informal content like tweets. VADER’s scoring mechanism is particularly sensitive to social media features such as emojis, capitalization, and punctuation, which are often critical to conveying tone in these environments (Barik and Misra 2024).
XGBoost(eXtreme Gradient Boost): To build a full machine learning pipeline around VADER, we will use its sentiment scores labels and train an XGBoost classifier using TF-IDF features extracted from the message text. XGBoost is well-suited for this task because it performs efficiently with sparse, high-dimensional data and eliminates the need to hand-label messages or train a separate sentiment model from scratch.
This strategy bridges rule-based interpretability with machine learning accuracy without requiring manual sentiment labels.
We used the Customer Support on Twitter dataset (Kaggle):
This dataset is ideal for sentiment analysis and predictive modeling.
Designed for short, informal text like tweets VADER is particularly effective because it incorporates key linguistic signals such as:
Capitalization -> (e.g., “AWESOME”)
Punctuation -> (e.g., “!!!”)
Emojis -> (e.g., “:)”)
Negations -> (e.g., “not good” → negative)
These elements help capture the nuanced sentiment found in customer service conversations that traditional lexicon models may often miss.
Limitations of Lexicons:
Why XGBoost:
Addresses VADER’s limitations (e.g., static lexicon, rule-bound)
Performs well on high-dimensional sparse text (TF-IDF)
Allows for interpretability & scalability
-VADER for initial sentiment labeling
-Representing Text as Features with TF-IDF
-XGBoost for supervised classification
-Evaluated with Accuracy, Precision, Recall, and F1
This automated labeling process served as the backbone for our supervised classification model.
For example, a tweet reading:
“I’ve been delayed over an HOUR and STILL no response… this is ridiculous!!!”
Feature | Detected Element | VADER Response | Score Impact |
---|---|---|---|
Capitalization | “HOUR”, “STILL” | Increases intensity | -0.10 |
Punctuation | “…” and “!!!” | Amplifies negative sentiment | -0.25 |
Lexicon Match | “ridiculous” | Strong negative valence | -0.25 |
Overall Tone | Complaint/frustration | Negative valence | -0.15 |
This tweet produces a Final Compound score of -0.75 and is labeled as negative.
To convert tweets into numerical features for modeling, we employ Term Frequency–Inverse Document Frequency (TF-IDF), a technique that quantifies how important each word is within the context of both the individual tweet and the overall corpus.
Term Frequency (TF) measures how often a word appears in a single tweet (i.e., domain) relative to the total number of words in that tweet:
\[ \text{TF}_{w_n} = \frac{g_{w_n}^{d_m}}{T_{d_m}} \]
Where:
• \(w_n\) is the \(n^{\text{th}}\) word in domain \(d_m\) (a tweet)
• \(g_{w_n}^{d_m}\) is the number of times word \(w_n\) occurs in domain \(d_m\)
• \(T_{d_m}\) is the total number of words in domain \(d_m\)
If the word delay appears twice in a 50-word tweet, its term frequency is:
\[ \text{TF}_{w_n} = \frac{2}{50} = 0.04 \]
Inverse Document Frequency (IDF) evaluates how unique or informative a word is across the full set of tweets. Common words receive lower IDF scores, while rare or distinctive words receive higher scores:
\[ \text{IDF}_{w_n} = \log\left(\frac{T_{d_m}}{N_{w_n}}\right) \]
Where:
• \(N_{w_n}\) is the number of documents that contain word \(w_n\)
Example:
If delay appears in 5 out of 500,000 tweets, its IDF will be much higher than that of hello, which may appear in 10,000 tweets.
Finally, TF-IDF combines these two metrics to weight each word by how frequently it appears in a tweet and how rare it is across the full dataset:
\[ \text{TF-IDF}_{w_n} = \text{TF}_{w_n} \times \text{IDF}_{w_n} \] So why TF-IDF?
Emotionally charged or unique words
Generates a sparse matrix ideal for boosting models
This process highlights terms that are both prominent in a tweet and distinctive across the dataset, making TF-IDF a powerful and interpretable technique for feature extraction in sentiment analysis pipelines (Barik and Misra 2024).
XGBoost (eXtreme Gradient Boosting), is a scalable and regularized tree ensemble algorithm designed for both accuracy and efficiency. It builds an additive model by iteratively constructing decision trees that minimize a regularized objective function, which balances prediction accuracy with model simplicity.
The objective consists of two components:
A convex loss function that measures how well the model fits the data
A regularization term that penalizes overly complex trees
Each predicted class label \(\hat{y}_i\) is computed as the sum of outputs from \(K\) trees: \[ \hat{y}_i = \phi(x_i) = \sum_{k=1}^K f_k(x_i), \quad f_k \in \mathcal{F} \] Where:
\(x_i\): Input TF-IDF vector for tweet \(i\)
\(f_k(x_i)\): Prediction from \(k^{\text{th}}\) tree
\(\phi(x_i)\): Combined prediction
This formula is foundational to XGBoost.
In classifying sentiment labels, the accumulated scores are passed through a softmax functions to determine class probabilities.
Assume we use 3 boosting rounds (trees) per class.
For a new tweet \(x_i\):
Predicted label: Positive, because it has the highest total score.
Objective Function:
\[ \mathcal{L}(\phi) = \sum_{i} l(\hat{y}_i, y_i) + \sum_{k} \Omega(f_k) \]
Regularization Term:
\[ \Omega(f_k) = \gamma T + \frac{1}{2} \lambda \| w \|^2 \]
Component Breakdown:
This structure balances model fit and complexity — helping prevent overfitting and improve generalization.
To understand how well our model performed, we used four core metrics:
Accuracy: Overall correct predictions.
Precision: The proportion of correct predictions among all tweets the model labeled as a given class.
Recall: The proportion of actual sentiment instances that were correctly identified.
F1 Score: The harmonic mean of precision and recall.
Class Imbalance:
These metrics were selected to account for class imbalance, which is common in sentiment data sets.
For instance, positive tweets dominate the volume, while negative tweets are more operationally important in customer service. Therefore, we paid close attention to class-specific precision and recall, especially for the negative class, crucial for identifying frustrated users.
Dataset: Customer Support on Twitter (Kaggle)
2.8M tweets exchanged between major companies and customers
Informal, emotionally expressive, real-world support conversations
Fields used:
‘text’: Tweet content
‘inbound’: Identifies if the tweet is from a customer (TRUE) or company (FALSE)
Note: Metadata like threading was available but excluded from modeling
Surprisingly, we found a high percentage of positive tweets contradicting our initial expectation to find most support messages to be complaints or frustration.
Figure: Word clouds by sentiment class
TF-IDF helps us identify which words are not just common, but actually important for telling tweets apart. It’s like finding the loudest voices in a crowded room.
Words like ‘dm’, ‘help’, ‘thanks’, ‘sorry’, and ‘account’ highlight the nature of support conversations—many of which are requests for assistance, apologies, or follow-ups.
These high-weighted features help the XGBoost model detect tone and intent without needing deep semantic understanding. For example, the presence of words like ‘sorry’ and ‘delay’ may signal negative sentiment, while ‘thanks’ or ‘hi’ may suggest a positive or neutral interaction.
Classifier: We used XGBoost
Training data: 150,000 tweets (50k per class, stratified)
Test set: 562,355 tweets
Feature Input: TF-IDF matrix (5,000 n-grams)
Class Weighting: Applied to improve recall for Negative tweets
Objective: Softmax multiclass loss with regularization
Output: Class label (Positive, Neutral, Negative)
Metric | Value |
---|---|
Accuracy | 77.1% |
Precision | 80.96% |
Recall | 77.1% |
F1 Score | 77.45% |
Balanced performance, with emphasis on improving recall for negative sentiment
Sentiment | Precision | Recall | F1 Score |
---|---|---|---|
Negative | 0.74 | 0.68 | 0.71 |
Neutral | 0.62 | 0.95 | 0.75 |
Positive | 0.93 | 0.73 | 0.82 |
High recall for Neutral (templated replies)
High precision for Positive
Improved recall for Negative from 64% → 68%
A key goal: identifying dissatisfaction more reliably.
Recall on Negative tweets improved from 64% → 68% using sample weighting
Business Impact
Limitations:
Future Work:
This project successfully demonstrated a scalable approach to sentiment classification in customer support conversations by combining VADER, TF-IDF, and XGBoost. VADER provided fast and interpretable sentiment labels tailored for informal social media language, which we used to train a high-performing supervised classifier.
Achieved:
By automating tone detection in real-time support channels, this framework offers immediate business value. It can help teams prioritize escalations, identify service bottlenecks, and monitor agent interactions at scale. Our findings confirm that interpretable, rule-based sentiment scoring (via VADER) can be successfully integrated with machine learning to support responsive, tone-aware customer engagement.