Natural Language Processing : Spam Detection, Sentiment Analysis, and Topic Modeling

Overview

This repository contains several NLP projects and experiments, including:

LLM_Usage_part_1.ipynb: Classifies emails as spam or ham using a Hugging Face LLM (zero-shot classification) on the spam.csv dataset. Evaluates accuracy, F1, recall, and confusion matrix.
LLM_Usage_part_2.py: Streamlit app for a local chatbot using the Flan-T5-small model. Allows interactive Q&A with an LLM running locally.
Sentiment_analysis_part_1.ipynb: Performs sentiment analysis on tweets using the VADER sentiment analyzer. Includes text cleaning and labeling.
Sentiment_analysis_part_2.ipynb: Performs sentiment analysis on tweets using a Hugging Face transformer pipeline. Includes text cleaning and batch inference.
Topic_modeling.ipynb: Topic modeling on the newsgroups dataset using LDA and NMF. Visualizes topics with bar charts and word clouds.

spam.csv: SMS messages labeled as spam or ham (not spam). Used for spam classification.
tweets-data.csv: Tweets with metadata (date, likes, hashtags, etc.). Used for sentiment analysis.
newsgroups: Binary (pickled) file containing the 20 Newsgroups dataset or similar. Used for topic modeling.

Install the required Python packages:

pip install transformers pandas tqdm nltk vaderSentiment scikit-learn matplotlib wordcloud streamlit torch

For notebooks using NLTK, you may need to download resources:

import nltk
nltk.download('stopwords')
nltk.download('punkt')

Open the Jupyter notebooks (.ipynb) for step-by-step code and explanations.
Run LLM_Usage_part_2.py with Streamlit:streamlit run LLM_Usage_part_2.py
Place the datasets (spam.csv, tweets-data.csv, newsgroups) in the project root.

The newsgroups file is binary and should be loaded with Python’s pickle module.
Some scripts may require GPU or MPS support for faster inference.

Feel free to adapt or extend these notebooks for your own NLP experiments!

The repo is accessible HERE