QMSSGR5074 - Project 3¶

Your Group ID: [Fill Here]¶

Your UNIs: [Fill Here]¶

Your Full Names: [Fill Here]¶

Public GitHub Repo: [Fill Here]¶

Description¶

Part 1 – Data Ingestion & Preprocessing¶

Data Loading
- Acquire the Stanford Sentiment Treebank dataset.
- Split into training, validation and test sets with stratified sampling to preserve class balance.
- Clearly document your splitting strategy and resulting dataset sizes.
Text Cleaning & Tokenization
- Implement a reusable preprocessing pipeline that handles at least:
  - HTML removal, lowercasing, punctuation stripping
  - Vocabulary pruning (e.g., rare words threshold)
  - Tokenization (character- or word-level)
- Expose this as a function/class so it can be saved and re-loaded for inference.
Feature Extraction
- Traditional: Build a TF-IDF vectorizer (or n-gram count) pipeline.
- Neural: Prepare sequences for embedding—pad/truncate to a fixed length.
- Save each preprocessor (vectorizer/tokenizer) to disk.

Part 2 – Exploratory Data Analysis (EDA)¶

Class Distribution
- Visualize the number of positive vs. negative reviews.
- Compute descriptive statistics on review lengths (mean, median, IQR).
Text Characteristics
- Plot the 20 most frequent tokens per sentiment class.
- Generate word clouds (or bar charts) highlighting key terms for each class.
Correlation Analysis
- Analyze whether review length correlates with sentiment.
- Present findings numerically and with at least one visualization.

Part 3 – Baseline Traditional Models¶

Logistic Regression & SVM
- Train at least two linear models on your TF-IDF features (e.g., logistic regression, linear SVM).
- Use cross-validation (≥ 5 folds) on the training set to tune at least one hyperparameter.
Random Forest & Gradient Boosting
- Train two tree-based models (e.g., Random Forest, XGBoost) on the same features.
- Report feature-importance for each and discuss any notable tokens.
Evaluation Metrics
- Compute accuracy, precision, recall, F1-score, and ROC-AUC on the held-out test set.
- Present all results in a single comparison table.

Part 4 – Neural Network Models¶

Simple Feed-Forward
- Build an embedding layer + a dense MLP classifier.
- Ensure you freeze vs. unfreeze embeddings in separate runs.
Convolutional Text Classifier
- Implement a 1D-CNN architecture (Conv + Pooling) for sequence data.
- Justify your choice of kernel sizes and number of filters.
Recurrent Model (Optional)
- (Stretch) Add an RNN or Bi-LSTM layer and compare performance/time vs. CNN.

Part 5 – Transfer Learning & Advanced Architectures¶

Pre-trained Embeddings
- Retrain one network using pre-trained GloVe (or FastText) embeddings.
- Compare results against your from-scratch embedding runs.
Transformer Fine-Tuning
- Fine-tune a BERT-family model on the training data.
- Clearly outline your training hyperparameters (learning rate, batch size, epochs).

Part 6 – Hyperparameter Optimization¶

Search Strategy
- Use a library (e.g., Keras Tuner, Optuna) to optimize at least two hyperparameters of one deep model.
- Describe your search space and stopping criteria.
Results Analysis
- Report the best hyperparameter configuration found.
- Plot validation-loss (or metric) vs. trials to illustrate tuning behavior.

Part 7 – Final Comparison & Error Analysis¶

Consolidated Results
- Tabulate test-set performance for all models (traditional, neural, transfer-learned).
- Highlight top‐performing model overall and top in each category.
Statistical Significance
- Perform a significance test (e.g., McNemar’s test) between your best two models.
Error Analysis
- Identify at least 20 examples your best model misclassified.
- For a sample of 5, provide the raw text, predicted vs. true label, and a short discussion of each error—what linguistic artifact might have confused the model?

Part 8 – Optional Challenge Extensions¶

Implement data augmentation for text (back-translation, synonym swapping) and measure its impact.
Integrate a sentiment lexicon feature (e.g., VADER scores) into your models and assess whether it improves predictions.
Deploy your best model as a simple REST API using Flask or FastAPI and demo it on a handful of user‐submitted reviews.

Part 1 – Data Ingestion & Preprocessing¶

Data Loading
- Acquire the Stanford Sentiment Treebank dataset.
- Split into training, validation, and test sets with stratified sampling to preserve class balance.
- Clearly document your splitting strategy and resulting dataset sizes.

In [ ]:

# Load data (example)
import pandas as pd


# IMPORT DATA
!git clone https://github.com/YJiangcm/SST-2-sentiment-analysis.git

# Assuming the dataset is CSV for illustration
df = pd.read_csv('sst_data.csv')
df.head()

Text Cleaning & Tokenization
- Implement a reusable preprocessing pipeline that handles at least:
  - HTML removal, lowercasing, punctuation stripping
  - Vocabulary pruning (e.g., rare words threshold)
  - Tokenization (character- or word-level)
- Expose this as a function/class so it can be saved and re-loaded for inference.

In [ ]:

import re
from sklearn.feature_extraction.text import CountVectorizer

def clean_text(text):
    text = re.sub(r'<[^>]*>', '', text)  # Remove HTML tags
    text = re.sub(r'\W+', ' ', text.lower())  # Remove non-alphanumeric characters
    return text

# Example usage
df['cleaned_review'] = df['review'].apply(clean_text)
df.head()

Feature Extraction
- Traditional: Build a TF-IDF vectorizer (or n-gram count) pipeline.
- Neural: Prepare sequences for embedding—pad/truncate to a fixed length.
- Save each preprocessor (vectorizer/tokenizer) to disk.

In [ ]:

from sklearn.feature_extraction.text import TfidfVectorizer

# Initialize TF-IDF Vectorizer
vectorizer = TfidfVectorizer(max_features=5000)

# Fit on cleaned text
X = vectorizer.fit_transform(df['cleaned_review'])
print(X.shape)  # Output feature shape

Part 2 – Exploratory Data Analysis (EDA)¶

Class Distribution
- Visualize the number of positive vs. negative reviews.
- Compute descriptive statistics on review lengths (mean, median, IQR).

In [ ]:

import matplotlib.pyplot as plt

# Visualize class distribution
df['sentiment'].value_counts().plot(kind='bar')
plt.title("Class Distribution")
plt.ylabel("Number of Reviews")
plt.show()

Part 3 – Baseline Traditional Models¶

Logistic Regression & SVM
- Train at least two linear models on your TF-IDF features.
- Use cross-validation (≥ 5 folds) on the training set to tune at least one hyperparameter.

In [ ]:

from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.model_selection import cross_val_score

# Logistic Regression
logreg = LogisticRegression()
logreg_scores = cross_val_score(logreg, X, df['sentiment'], cv=5)

# SVM
svm = SVC()
svm_scores = cross_val_score(svm, X, df['sentiment'], cv=5)

# Print accuracy for both models
print("Logistic Regression Scores:", logreg_scores)
print("SVM Scores:", svm_scores)

Part 4 – Neural Network Models¶

Simple Feed-Forward
- Build an embedding layer + a dense MLP classifier.
- Ensure you freeze vs. unfreeze embeddings in separate runs.

In [ ]:

import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Embedding, Flatten

model = Sequential([
    Embedding(input_dim=5000, output_dim=128, input_length=500),
    Flatten(),
    Dense(64, activation='relu'),
    Dense(1, activation='sigmoid')
])

model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
model.summary()

Part 5 – Transfer Learning & Advanced Architectures¶

Pre-trained Embeddings
- Retrain one network using pre-trained GloVe (or FastText) embeddings.
- Compare results against your from-scratch embedding runs.

In [ ]:

# Assuming GloVe embeddings are loaded here
from tensorflow.keras.layers import Embedding

embedding_matrix = ...  # Load pre-trained GloVe matrix
embedding_layer = Embedding(input_dim=5000, output_dim=128, weights=[embedding_matrix], trainable=False)
model.add(embedding_layer)
model.summary()

Part 6 – Hyperparameter Optimization¶

Search Strategy
- Use a library (e.g., Keras Tuner, Optuna) to optimize at least two hyperparameters of one deep model.
- Describe your search space and stopping criteria.

In [ ]:

from keras_tuner import RandomSearch

def build_model(hp):
    model = Sequential([
        Dense(hp.Int('units', min_value=32, max_value=128, step=32), activation='relu', input_shape=(500,)),
        Dense(1, activation='sigmoid')
    ])
    model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
    return model

tuner = RandomSearch(build_model, objective='val_accuracy', max_trials=10)
tuner.search(X, df['sentiment'], epochs=10, validation_data=(X_test, y_test))

Part 7 – Final Comparison & Error Analysis¶

Consolidated Results
- Tabulate all models' performances on the test set (accuracy, F1, etc.)
- Identify the best-performing model and its hyperparameters.

In [ ]:

# Example consolidated results
results = {
    'Model': ['Logistic Regression', 'SVM', 'Simple NN'],
    'Accuracy': [0.85, 0.83, 0.88],
    'F1 Score': [0.86, 0.84, 0.89]
}
pd.DataFrame(results)

Part 8 – Optional Challenge Extensions¶

Data Augmentation
- Implement data augmentation for text (back-translation, synonym swapping) and measure its impact.

In [ ]:

# Example for back-translation using a library
from googletrans import Translator

translator = Translator()
df['augmented_review'] = df['review'].apply(lambda x: translator.translate(x, src='en', dest='fr').text)
df['augmented_review'] = df['augmented_review'].apply(lambda x: translator.translate(x, src='fr', dest='en').text)
df.head()

Reflecting¶

Answer the following inference questions:

Part 1 – Data Ingestion & Preprocessing¶

Data Loading
- How do you ensure that your dataset is properly split into training, validation, and test sets, and why is class balance important during data splitting?
Text Cleaning & Tokenization
- What is the role of tokenization in text preprocessing, and how does it impact the model's performance?

Part 2 – Exploratory Data Analysis (EDA)¶

Class Distribution
- How does the class distribution (positive vs negative reviews) impact the model’s performance, and what strategies can be used if the dataset is imbalanced?
Text Characteristics
- What insights can be gained from visualizing word clouds for each sentiment class, and how can it improve feature engineering?

Part 3 – Baseline Traditional Models¶

Logistic Regression & SVM
- Why do you use cross-validation when training models like logistic regression or SVM, and how does it help prevent overfitting?
Random Forest & Gradient Boosting
- What role does feature importance play in interpreting Random Forest or XGBoost models?

Part 4 – Neural Network Models¶

Simple Feed-Forward
- Why is embedding freezing used when training neural networks on pre-trained embeddings, and how does it affect model performance?
Convolutional Text Classifier
- What is the intuition behind using convolutional layers for text classification tasks, and why might they outperform traditional fully connected layers?

Part 5 – Transfer Learning & Advanced Architectures¶

Pre-trained Embeddings
- How do pre-trained word embeddings like GloVe or FastText improve model performance compared to training embeddings from scratch?
Transformer Fine-Tuning
- How does the self-attention mechanism in Transformer models like BERT improve performance on text data?

Part 6 – Hyperparameter Optimization¶

Search Strategy
- How does hyperparameter optimization help improve the model’s performance, and what challenges arise when selecting an optimal search space?
Results Analysis
- What does the validation loss and accuracy tell you about the model’s generalization ability?

Part 7 – Final Comparison & Error Analysis¶

Consolidated Results
- How do you compare models with different architectures (e.g., logistic regression vs. BERT) to select the best model for deployment?
Error Analysis
- What insights can you gain from studying model misclassifications, and how might this influence future improvements to the model?

Part 8 – Optional Challenge Extensions¶

Data Augmentation
- How does back-translation or synonym swapping as text augmentation improve model generalization?
Sentiment Lexicon
- How might integrating sentiment lexicons like VADER improve the sentiment classification model, and what are the challenges of using lexicon-based approaches alongside machine learning models?