QMSSGR5074 - Project 3¶

Your Group ID: [Fill Here]¶

Your UNIs: [Fill Here]¶

Your Full Names: [Fill Here]¶

Public GitHub Repo: [Fill Here]¶

Description¶

Part 1 – Data Ingestion & Preprocessing¶

  1. Data Loading

    • Acquire the Stanford Sentiment Treebank dataset.
    • Split into training, validation and test sets with stratified sampling to preserve class balance.
    • Clearly document your splitting strategy and resulting dataset sizes.
  2. Text Cleaning & Tokenization

    • Implement a reusable preprocessing pipeline that handles at least:
      • HTML removal, lowercasing, punctuation stripping
      • Vocabulary pruning (e.g., rare words threshold)
      • Tokenization (character- or word-level)
    • Expose this as a function/class so it can be saved and re-loaded for inference.
  3. Feature Extraction

    • Traditional: Build a TF-IDF vectorizer (or n-gram count) pipeline.
    • Neural: Prepare sequences for embedding—pad/truncate to a fixed length.
    • Save each preprocessor (vectorizer/tokenizer) to disk.

Part 2 – Exploratory Data Analysis (EDA)¶

  1. Class Distribution

    • Visualize the number of positive vs. negative reviews.
    • Compute descriptive statistics on review lengths (mean, median, IQR).
  2. Text Characteristics

    • Plot the 20 most frequent tokens per sentiment class.
    • Generate word clouds (or bar charts) highlighting key terms for each class.
  3. Correlation Analysis

    • Analyze whether review length correlates with sentiment.
    • Present findings numerically and with at least one visualization.

Part 3 – Baseline Traditional Models¶

  1. Logistic Regression & SVM

    • Train at least two linear models on your TF-IDF features (e.g., logistic regression, linear SVM).
    • Use cross-validation (≥ 5 folds) on the training set to tune at least one hyperparameter.
  2. Random Forest & Gradient Boosting

    • Train two tree-based models (e.g., Random Forest, XGBoost) on the same features.
    • Report feature-importance for each and discuss any notable tokens.
  3. Evaluation Metrics

    • Compute accuracy, precision, recall, F1-score, and ROC-AUC on the held-out test set.
    • Present all results in a single comparison table.

Part 4 – Neural Network Models¶

  1. Simple Feed-Forward

    • Build an embedding layer + a dense MLP classifier.
    • Ensure you freeze vs. unfreeze embeddings in separate runs.
  2. Convolutional Text Classifier

    • Implement a 1D-CNN architecture (Conv + Pooling) for sequence data.
    • Justify your choice of kernel sizes and number of filters.
  3. Recurrent Model (Optional)

    • (Stretch) Add an RNN or Bi-LSTM layer and compare performance/time vs. CNN.

Part 5 – Transfer Learning & Advanced Architectures¶

  1. Pre-trained Embeddings

    • Retrain one network using pre-trained GloVe (or FastText) embeddings.
    • Compare results against your from-scratch embedding runs.
  2. Transformer Fine-Tuning

    • Fine-tune a BERT-family model on the training data.
    • Clearly outline your training hyperparameters (learning rate, batch size, epochs).

Part 6 – Hyperparameter Optimization¶

  1. Search Strategy

    • Use a library (e.g., Keras Tuner, Optuna) to optimize at least two hyperparameters of one deep model.
    • Describe your search space and stopping criteria.
  2. Results Analysis

    • Report the best hyperparameter configuration found.
    • Plot validation-loss (or metric) vs. trials to illustrate tuning behavior.

Part 7 – Final Comparison & Error Analysis¶

  1. Consolidated Results

    • Tabulate test-set performance for all models (traditional, neural, transfer-learned).
    • Highlight top‐performing model overall and top in each category.
  2. Statistical Significance

    • Perform a significance test (e.g., McNemar’s test) between your best two models.
  3. Error Analysis

    • Identify at least 20 examples your best model misclassified.
    • For a sample of 5, provide the raw text, predicted vs. true label, and a short discussion of each error—what linguistic artifact might have confused the model?

Part 8 – Optional Challenge Extensions¶

  • Implement data augmentation for text (back-translation, synonym swapping) and measure its impact.
  • Integrate a sentiment lexicon feature (e.g., VADER scores) into your models and assess whether it improves predictions.
  • Deploy your best model as a simple REST API using Flask or FastAPI and demo it on a handful of user‐submitted reviews.

Part 1 – Data Ingestion & Preprocessing¶

  1. Data Loading
    • Acquire the Stanford Sentiment Treebank dataset.
    • Split into training, validation, and test sets with stratified sampling to preserve class balance.
    • Clearly document your splitting strategy and resulting dataset sizes.
In [ ]:
# Load data (example)
import pandas as pd


# IMPORT DATA
!git clone https://github.com/YJiangcm/SST-2-sentiment-analysis.git

# Assuming the dataset is CSV for illustration
df = pd.read_csv('sst_data.csv')
df.head()
  1. Text Cleaning & Tokenization
    • Implement a reusable preprocessing pipeline that handles at least:
      • HTML removal, lowercasing, punctuation stripping
      • Vocabulary pruning (e.g., rare words threshold)
      • Tokenization (character- or word-level)
    • Expose this as a function/class so it can be saved and re-loaded for inference.
In [ ]:
import re
from sklearn.feature_extraction.text import CountVectorizer

def clean_text(text):
    text = re.sub(r'<[^>]*>', '', text)  # Remove HTML tags
    text = re.sub(r'\W+', ' ', text.lower())  # Remove non-alphanumeric characters
    return text

# Example usage
df['cleaned_review'] = df['review'].apply(clean_text)
df.head()
  1. Feature Extraction
    • Traditional: Build a TF-IDF vectorizer (or n-gram count) pipeline.
    • Neural: Prepare sequences for embedding—pad/truncate to a fixed length.
    • Save each preprocessor (vectorizer/tokenizer) to disk.
In [ ]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Initialize TF-IDF Vectorizer
vectorizer = TfidfVectorizer(max_features=5000)

# Fit on cleaned text
X = vectorizer.fit_transform(df['cleaned_review'])
print(X.shape)  # Output feature shape

Part 2 – Exploratory Data Analysis (EDA)¶

  1. Class Distribution
    • Visualize the number of positive vs. negative reviews.
    • Compute descriptive statistics on review lengths (mean, median, IQR).
In [ ]:
import matplotlib.pyplot as plt

# Visualize class distribution
df['sentiment'].value_counts().plot(kind='bar')
plt.title("Class Distribution")
plt.ylabel("Number of Reviews")
plt.show()

Part 3 – Baseline Traditional Models¶

  1. Logistic Regression & SVM
    • Train at least two linear models on your TF-IDF features.
    • Use cross-validation (≥ 5 folds) on the training set to tune at least one hyperparameter.
In [ ]:
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.model_selection import cross_val_score

# Logistic Regression
logreg = LogisticRegression()
logreg_scores = cross_val_score(logreg, X, df['sentiment'], cv=5)

# SVM
svm = SVC()
svm_scores = cross_val_score(svm, X, df['sentiment'], cv=5)

# Print accuracy for both models
print("Logistic Regression Scores:", logreg_scores)
print("SVM Scores:", svm_scores)

Part 4 – Neural Network Models¶

  1. Simple Feed-Forward
    • Build an embedding layer + a dense MLP classifier.
    • Ensure you freeze vs. unfreeze embeddings in separate runs.
In [ ]:
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Embedding, Flatten

model = Sequential([
    Embedding(input_dim=5000, output_dim=128, input_length=500),
    Flatten(),
    Dense(64, activation='relu'),
    Dense(1, activation='sigmoid')
])

model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
model.summary()

Part 5 – Transfer Learning & Advanced Architectures¶

  1. Pre-trained Embeddings
    • Retrain one network using pre-trained GloVe (or FastText) embeddings.
    • Compare results against your from-scratch embedding runs.
In [ ]:
# Assuming GloVe embeddings are loaded here
from tensorflow.keras.layers import Embedding

embedding_matrix = ...  # Load pre-trained GloVe matrix
embedding_layer = Embedding(input_dim=5000, output_dim=128, weights=[embedding_matrix], trainable=False)
model.add(embedding_layer)
model.summary()

Part 6 – Hyperparameter Optimization¶

  1. Search Strategy
    • Use a library (e.g., Keras Tuner, Optuna) to optimize at least two hyperparameters of one deep model.
    • Describe your search space and stopping criteria.
In [ ]:
from keras_tuner import RandomSearch

def build_model(hp):
    model = Sequential([
        Dense(hp.Int('units', min_value=32, max_value=128, step=32), activation='relu', input_shape=(500,)),
        Dense(1, activation='sigmoid')
    ])
    model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
    return model

tuner = RandomSearch(build_model, objective='val_accuracy', max_trials=10)
tuner.search(X, df['sentiment'], epochs=10, validation_data=(X_test, y_test))

Part 7 – Final Comparison & Error Analysis¶

  1. Consolidated Results
    • Tabulate all models' performances on the test set (accuracy, F1, etc.)
    • Identify the best-performing model and its hyperparameters.
In [ ]:
# Example consolidated results
results = {
    'Model': ['Logistic Regression', 'SVM', 'Simple NN'],
    'Accuracy': [0.85, 0.83, 0.88],
    'F1 Score': [0.86, 0.84, 0.89]
}
pd.DataFrame(results)

Part 8 – Optional Challenge Extensions¶

  1. Data Augmentation
    • Implement data augmentation for text (back-translation, synonym swapping) and measure its impact.
In [ ]:
# Example for back-translation using a library
from googletrans import Translator

translator = Translator()
df['augmented_review'] = df['review'].apply(lambda x: translator.translate(x, src='en', dest='fr').text)
df['augmented_review'] = df['augmented_review'].apply(lambda x: translator.translate(x, src='fr', dest='en').text)
df.head()

Reflecting¶

Answer the following inference questions:

Part 1 – Data Ingestion & Preprocessing¶

  1. Data Loading

    • How do you ensure that your dataset is properly split into training, validation, and test sets, and why is class balance important during data splitting?
  2. Text Cleaning & Tokenization

    • What is the role of tokenization in text preprocessing, and how does it impact the model's performance?

Part 2 – Exploratory Data Analysis (EDA)¶

  1. Class Distribution

    • How does the class distribution (positive vs negative reviews) impact the model’s performance, and what strategies can be used if the dataset is imbalanced?
  2. Text Characteristics

    • What insights can be gained from visualizing word clouds for each sentiment class, and how can it improve feature engineering?

Part 3 – Baseline Traditional Models¶

  1. Logistic Regression & SVM

    • Why do you use cross-validation when training models like logistic regression or SVM, and how does it help prevent overfitting?
  2. Random Forest & Gradient Boosting

    • What role does feature importance play in interpreting Random Forest or XGBoost models?

Part 4 – Neural Network Models¶

  1. Simple Feed-Forward

    • Why is embedding freezing used when training neural networks on pre-trained embeddings, and how does it affect model performance?
  2. Convolutional Text Classifier

    • What is the intuition behind using convolutional layers for text classification tasks, and why might they outperform traditional fully connected layers?

Part 5 – Transfer Learning & Advanced Architectures¶

  1. Pre-trained Embeddings

    • How do pre-trained word embeddings like GloVe or FastText improve model performance compared to training embeddings from scratch?
  2. Transformer Fine-Tuning

    • How does the self-attention mechanism in Transformer models like BERT improve performance on text data?

Part 6 – Hyperparameter Optimization¶

  1. Search Strategy

    • How does hyperparameter optimization help improve the model’s performance, and what challenges arise when selecting an optimal search space?
  2. Results Analysis

    • What does the validation loss and accuracy tell you about the model’s generalization ability?

Part 7 – Final Comparison & Error Analysis¶

  1. Consolidated Results

    • How do you compare models with different architectures (e.g., logistic regression vs. BERT) to select the best model for deployment?
  2. Error Analysis

    • What insights can you gain from studying model misclassifications, and how might this influence future improvements to the model?

Part 8 – Optional Challenge Extensions¶

  1. Data Augmentation

    • How does back-translation or synonym swapping as text augmentation improve model generalization?
  2. Sentiment Lexicon

    • How might integrating sentiment lexicons like VADER improve the sentiment classification model, and what are the challenges of using lexicon-based approaches alongside machine learning models?