{"id":81,"date":"2025-07-03T12:00:24","date_gmt":"2025-07-03T10:00:24","guid":{"rendered":"https:\/\/lerecrutementdusiecle.fr\/?page_id=81"},"modified":"2025-07-03T12:05:49","modified_gmt":"2025-07-03T10:05:49","slug":"natural-language-processing-spam-detection-sentiment-analysis-and-topic-modeling","status":"publish","type":"page","link":"https:\/\/lerecrutementdusiecle.fr\/index.php\/natural-language-processing-spam-detection-sentiment-analysis-and-topic-modeling\/","title":{"rendered":"Natural Language Processing : Spam Detection, Sentiment Analysis, and Topic Modeling"},"content":{"rendered":"\n<h2 class=\"wp-block-heading\">Overview<\/h2>\n\n\n\n<p>This repository contains several NLP projects and experiments, including:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Spam classification using large language models (LLMs)<\/li>\n\n\n\n<li>Sentiment analysis on tweets<\/li>\n\n\n\n<li>Topic modeling on the 20 Newsgroups dataset<\/li>\n\n\n\n<li>A Streamlit chatbot interface for local LLM inference<\/li>\n<\/ul>\n\n\n\n<h2 class=\"wp-block-heading\">Contents<\/h2>\n\n\n\n<h3 class=\"wp-block-heading\">Notebooks &amp; Scripts<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>LLM_Usage_part_1.ipynb<\/strong>: Classifies emails as spam or ham using a Hugging Face LLM (zero-shot classification) on the&nbsp;<code>spam.csv<\/code>&nbsp;dataset. Evaluates accuracy, F1, recall, and confusion matrix.<\/li>\n\n\n\n<li><strong>LLM_Usage_part_2.py<\/strong>: Streamlit app for a local chatbot using the Flan-T5-small model. Allows interactive Q&amp;A with an LLM running locally.<\/li>\n\n\n\n<li><strong>Sentiment_analysis_part_1.ipynb<\/strong>: Performs sentiment analysis on tweets using the VADER sentiment analyzer. Includes text cleaning and labeling.<\/li>\n\n\n\n<li><strong>Sentiment_analysis_part_2.ipynb<\/strong>: Performs sentiment analysis on tweets using a Hugging Face transformer pipeline. Includes text cleaning and batch inference.<\/li>\n\n\n\n<li><strong>Topic_modeling.ipynb<\/strong>: Topic modeling on the newsgroups dataset using LDA and NMF. Visualizes topics with bar charts and word clouds.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Datasets<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>spam.csv<\/strong>: SMS messages labeled as&nbsp;<code>spam<\/code>&nbsp;or&nbsp;<code>ham<\/code>&nbsp;(not spam). Used for spam classification.<\/li>\n\n\n\n<li><strong>tweets-data.csv<\/strong>: Tweets with metadata (date, likes, hashtags, etc.). Used for sentiment analysis.<\/li>\n\n\n\n<li><strong>newsgroups<\/strong>: Binary (pickled) file containing the 20 Newsgroups dataset or similar. Used for topic modeling.<\/li>\n<\/ul>\n\n\n\n<h2 class=\"wp-block-heading\">Setup &amp; Requirements<\/h2>\n\n\n\n<p>Install the required Python packages:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>pip install transformers pandas tqdm nltk vaderSentiment scikit-learn matplotlib wordcloud streamlit torch\n<\/code><\/pre>\n\n\n\n<p>For notebooks using NLTK, you may need to download resources:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>import nltk\nnltk.download('stopwords')\nnltk.download('punkt')\n<\/code><\/pre>\n\n\n\n<h2 class=\"wp-block-heading\">Usage<\/h2>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Open the Jupyter notebooks (<code>.ipynb<\/code>) for step-by-step code and explanations.<\/li>\n\n\n\n<li>Run&nbsp;<code>LLM_Usage_part_2.py<\/code>&nbsp;with Streamlit:<code>streamlit run LLM_Usage_part_2.py<\/code><\/li>\n\n\n\n<li>Place the datasets (<code>spam.csv<\/code>,&nbsp;<code>tweets-data.csv<\/code>,&nbsp;<code>newsgroups<\/code>) in the project root.<\/li>\n<\/ul>\n\n\n\n<h2 class=\"wp-block-heading\">Notes<\/h2>\n\n\n\n<ul class=\"wp-block-list\">\n<li>The&nbsp;<code>newsgroups<\/code>&nbsp;file is binary and should be loaded with Python&rsquo;s&nbsp;<code>pickle<\/code>&nbsp;module.<\/li>\n\n\n\n<li>Some scripts may require GPU or MPS support for faster inference.<\/li>\n<\/ul>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<p>Feel free to adapt or extend these notebooks for your own NLP experiments!<\/p>\n\n\n\n<p>The repo is accessible <a href=\"https:\/\/github.com\/yohehehel\/NLP-assignement-3\">HERE<\/a>\u00a0<\/p>\n\n\n\n<p><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Overview This repository contains several NLP projects and experiments, including: Contents Notebooks &amp; Scripts Datasets Setup &amp; Requirements Install the required Python packages: For notebooks using NLTK, you may need to download resources: Usage Notes Feel free to adapt or extend these notebooks for your own NLP experiments! The repo is accessible HERE\u00a0<\/p>\n","protected":false},"author":1,"featured_media":0,"parent":0,"menu_order":0,"comment_status":"closed","ping_status":"closed","template":"","meta":{"_themeisle_gutenberg_block_has_review":false,"footnotes":""},"class_list":["post-81","page","type-page","status-publish","hentry"],"_links":{"self":[{"href":"https:\/\/lerecrutementdusiecle.fr\/index.php\/wp-json\/wp\/v2\/pages\/81","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/lerecrutementdusiecle.fr\/index.php\/wp-json\/wp\/v2\/pages"}],"about":[{"href":"https:\/\/lerecrutementdusiecle.fr\/index.php\/wp-json\/wp\/v2\/types\/page"}],"author":[{"embeddable":true,"href":"https:\/\/lerecrutementdusiecle.fr\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/lerecrutementdusiecle.fr\/index.php\/wp-json\/wp\/v2\/comments?post=81"}],"version-history":[{"count":2,"href":"https:\/\/lerecrutementdusiecle.fr\/index.php\/wp-json\/wp\/v2\/pages\/81\/revisions"}],"predecessor-version":[{"id":87,"href":"https:\/\/lerecrutementdusiecle.fr\/index.php\/wp-json\/wp\/v2\/pages\/81\/revisions\/87"}],"wp:attachment":[{"href":"https:\/\/lerecrutementdusiecle.fr\/index.php\/wp-json\/wp\/v2\/media?parent=81"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}