QMSSGR5074 - Project 3¶
Description¶
Part 1 – Data Ingestion & Preprocessing¶
Data Loading
- Acquire the Stanford Sentiment Treebank dataset.
- Split into training, validation and test sets with stratified sampling to preserve class balance.
- Clearly document your splitting strategy and resulting dataset sizes.
Text Cleaning & Tokenization
- Implement a reusable preprocessing pipeline that handles at least:
- HTML removal, lowercasing, punctuation stripping
- Vocabulary pruning (e.g., rare words threshold)
- Tokenization (character- or word-level)
- Expose this as a function/class so it can be saved and re-loaded for inference.
- Implement a reusable preprocessing pipeline that handles at least:
Feature Extraction
- Traditional: Build a TF-IDF vectorizer (or n-gram count) pipeline.
- Neural: Prepare sequences for embedding—pad/truncate to a fixed length.
- Save each preprocessor (vectorizer/tokenizer) to disk.
Part 2 – Exploratory Data Analysis (EDA)¶
Class Distribution
- Visualize the number of positive vs. negative reviews.
- Compute descriptive statistics on review lengths (mean, median, IQR).
Text Characteristics
- Plot the 20 most frequent tokens per sentiment class.
- Generate word clouds (or bar charts) highlighting key terms for each class.
Correlation Analysis
- Analyze whether review length correlates with sentiment.
- Present findings numerically and with at least one visualization.
Part 3 – Baseline Traditional Models¶
Logistic Regression & SVM
- Train at least two linear models on your TF-IDF features (e.g., logistic regression, linear SVM).
- Use cross-validation (≥ 5 folds) on the training set to tune at least one hyperparameter.
Random Forest & Gradient Boosting
- Train two tree-based models (e.g., Random Forest, XGBoost) on the same features.
- Report feature-importance for each and discuss any notable tokens.
Evaluation Metrics
- Compute accuracy, precision, recall, F1-score, and ROC-AUC on the held-out test set.
- Present all results in a single comparison table.
Part 4 – Neural Network Models¶
Simple Feed-Forward
- Build an embedding layer + a dense MLP classifier.
- Ensure you freeze vs. unfreeze embeddings in separate runs.
Convolutional Text Classifier
- Implement a 1D-CNN architecture (Conv + Pooling) for sequence data.
- Justify your choice of kernel sizes and number of filters.
Recurrent Model (Optional)
- (Stretch) Add an RNN or Bi-LSTM layer and compare performance/time vs. CNN.
Part 5 – Transfer Learning & Advanced Architectures¶
Pre-trained Embeddings
- Retrain one network using pre-trained GloVe (or FastText) embeddings.
- Compare results against your from-scratch embedding runs.
Transformer Fine-Tuning
- Fine-tune a BERT-family model on the training data.
- Clearly outline your training hyperparameters (learning rate, batch size, epochs).
Part 6 – Hyperparameter Optimization¶
Search Strategy
- Use a library (e.g., Keras Tuner, Optuna) to optimize at least two hyperparameters of one deep model.
- Describe your search space and stopping criteria.
Results Analysis
- Report the best hyperparameter configuration found.
- Plot validation-loss (or metric) vs. trials to illustrate tuning behavior.
Part 7 – Final Comparison & Error Analysis¶
Consolidated Results
- Tabulate test-set performance for all models (traditional, neural, transfer-learned).
- Highlight top‐performing model overall and top in each category.
Statistical Significance
- Perform a significance test (e.g., McNemar’s test) between your best two models.
Error Analysis
- Identify at least 20 examples your best model misclassified.
- For a sample of 5, provide the raw text, predicted vs. true label, and a short discussion of each error—what linguistic artifact might have confused the model?
Part 8 – Optional Challenge Extensions¶
- Implement data augmentation for text (back-translation, synonym swapping) and measure its impact.
- Integrate a sentiment lexicon feature (e.g., VADER scores) into your models and assess whether it improves predictions.
- Deploy your best model as a simple REST API using Flask or FastAPI and demo it on a handful of user‐submitted reviews.
Part 1 – Data Ingestion & Preprocessing¶
- Data Loading
- Acquire the Stanford Sentiment Treebank dataset.
- Split into training, validation, and test sets with stratified sampling to preserve class balance.
- Clearly document your splitting strategy and resulting dataset sizes.
# Load data (example)
import pandas as pd
# IMPORT DATA
!git clone https://github.com/YJiangcm/SST-2-sentiment-analysis.git
# Assuming the dataset is CSV for illustration
df = pd.read_csv('sst_data.csv')
df.head()
- Text Cleaning & Tokenization
- Implement a reusable preprocessing pipeline that handles at least:
- HTML removal, lowercasing, punctuation stripping
- Vocabulary pruning (e.g., rare words threshold)
- Tokenization (character- or word-level)
- Expose this as a function/class so it can be saved and re-loaded for inference.
- Implement a reusable preprocessing pipeline that handles at least:
import re
from sklearn.feature_extraction.text import CountVectorizer
def clean_text(text):
text = re.sub(r'<[^>]*>', '', text) # Remove HTML tags
text = re.sub(r'\W+', ' ', text.lower()) # Remove non-alphanumeric characters
return text
# Example usage
df['cleaned_review'] = df['review'].apply(clean_text)
df.head()
- Feature Extraction
- Traditional: Build a TF-IDF vectorizer (or n-gram count) pipeline.
- Neural: Prepare sequences for embedding—pad/truncate to a fixed length.
- Save each preprocessor (vectorizer/tokenizer) to disk.
from sklearn.feature_extraction.text import TfidfVectorizer
# Initialize TF-IDF Vectorizer
vectorizer = TfidfVectorizer(max_features=5000)
# Fit on cleaned text
X = vectorizer.fit_transform(df['cleaned_review'])
print(X.shape) # Output feature shape
Part 2 – Exploratory Data Analysis (EDA)¶
- Class Distribution
- Visualize the number of positive vs. negative reviews.
- Compute descriptive statistics on review lengths (mean, median, IQR).
import matplotlib.pyplot as plt
# Visualize class distribution
df['sentiment'].value_counts().plot(kind='bar')
plt.title("Class Distribution")
plt.ylabel("Number of Reviews")
plt.show()
Part 3 – Baseline Traditional Models¶
- Logistic Regression & SVM
- Train at least two linear models on your TF-IDF features.
- Use cross-validation (≥ 5 folds) on the training set to tune at least one hyperparameter.
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.model_selection import cross_val_score
# Logistic Regression
logreg = LogisticRegression()
logreg_scores = cross_val_score(logreg, X, df['sentiment'], cv=5)
# SVM
svm = SVC()
svm_scores = cross_val_score(svm, X, df['sentiment'], cv=5)
# Print accuracy for both models
print("Logistic Regression Scores:", logreg_scores)
print("SVM Scores:", svm_scores)
Part 4 – Neural Network Models¶
- Simple Feed-Forward
- Build an embedding layer + a dense MLP classifier.
- Ensure you freeze vs. unfreeze embeddings in separate runs.
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Embedding, Flatten
model = Sequential([
Embedding(input_dim=5000, output_dim=128, input_length=500),
Flatten(),
Dense(64, activation='relu'),
Dense(1, activation='sigmoid')
])
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
model.summary()
Part 5 – Transfer Learning & Advanced Architectures¶
- Pre-trained Embeddings
- Retrain one network using pre-trained GloVe (or FastText) embeddings.
- Compare results against your from-scratch embedding runs.
# Assuming GloVe embeddings are loaded here
from tensorflow.keras.layers import Embedding
embedding_matrix = ... # Load pre-trained GloVe matrix
embedding_layer = Embedding(input_dim=5000, output_dim=128, weights=[embedding_matrix], trainable=False)
model.add(embedding_layer)
model.summary()
Part 6 – Hyperparameter Optimization¶
- Search Strategy
- Use a library (e.g., Keras Tuner, Optuna) to optimize at least two hyperparameters of one deep model.
- Describe your search space and stopping criteria.
from keras_tuner import RandomSearch
def build_model(hp):
model = Sequential([
Dense(hp.Int('units', min_value=32, max_value=128, step=32), activation='relu', input_shape=(500,)),
Dense(1, activation='sigmoid')
])
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
return model
tuner = RandomSearch(build_model, objective='val_accuracy', max_trials=10)
tuner.search(X, df['sentiment'], epochs=10, validation_data=(X_test, y_test))
Part 7 – Final Comparison & Error Analysis¶
- Consolidated Results
- Tabulate all models' performances on the test set (accuracy, F1, etc.)
- Identify the best-performing model and its hyperparameters.
# Example consolidated results
results = {
'Model': ['Logistic Regression', 'SVM', 'Simple NN'],
'Accuracy': [0.85, 0.83, 0.88],
'F1 Score': [0.86, 0.84, 0.89]
}
pd.DataFrame(results)
Part 8 – Optional Challenge Extensions¶
- Data Augmentation
- Implement data augmentation for text (back-translation, synonym swapping) and measure its impact.
# Example for back-translation using a library
from googletrans import Translator
translator = Translator()
df['augmented_review'] = df['review'].apply(lambda x: translator.translate(x, src='en', dest='fr').text)
df['augmented_review'] = df['augmented_review'].apply(lambda x: translator.translate(x, src='fr', dest='en').text)
df.head()
Reflecting¶
Answer the following inference questions:
Part 1 – Data Ingestion & Preprocessing¶
Data Loading
- How do you ensure that your dataset is properly split into training, validation, and test sets, and why is class balance important during data splitting?
Text Cleaning & Tokenization
- What is the role of tokenization in text preprocessing, and how does it impact the model's performance?
Part 2 – Exploratory Data Analysis (EDA)¶
Class Distribution
- How does the class distribution (positive vs negative reviews) impact the model’s performance, and what strategies can be used if the dataset is imbalanced?
Text Characteristics
- What insights can be gained from visualizing word clouds for each sentiment class, and how can it improve feature engineering?
Part 3 – Baseline Traditional Models¶
Logistic Regression & SVM
- Why do you use cross-validation when training models like logistic regression or SVM, and how does it help prevent overfitting?
Random Forest & Gradient Boosting
- What role does feature importance play in interpreting Random Forest or XGBoost models?
Part 4 – Neural Network Models¶
Simple Feed-Forward
- Why is embedding freezing used when training neural networks on pre-trained embeddings, and how does it affect model performance?
Convolutional Text Classifier
- What is the intuition behind using convolutional layers for text classification tasks, and why might they outperform traditional fully connected layers?
Part 5 – Transfer Learning & Advanced Architectures¶
Pre-trained Embeddings
- How do pre-trained word embeddings like GloVe or FastText improve model performance compared to training embeddings from scratch?
Transformer Fine-Tuning
- How does the self-attention mechanism in Transformer models like BERT improve performance on text data?
Part 6 – Hyperparameter Optimization¶
Search Strategy
- How does hyperparameter optimization help improve the model’s performance, and what challenges arise when selecting an optimal search space?
Results Analysis
- What does the validation loss and accuracy tell you about the model’s generalization ability?
Part 7 – Final Comparison & Error Analysis¶
Consolidated Results
- How do you compare models with different architectures (e.g., logistic regression vs. BERT) to select the best model for deployment?
Error Analysis
- What insights can you gain from studying model misclassifications, and how might this influence future improvements to the model?
Part 8 – Optional Challenge Extensions¶
Data Augmentation
- How does back-translation or synonym swapping as text augmentation improve model generalization?
Sentiment Lexicon
- How might integrating sentiment lexicons like VADER improve the sentiment classification model, and what are the challenges of using lexicon-based approaches alongside machine learning models?