Data Science Portfolio

Generative AI, Machine Learning, Data Science, Data Analysis, Deep Learning

Retrieval Augmentated Generation (RAG) Projects

Problem Statement: The GitHub repository RAG_Projects by Aarushi17Gupta contains several sub-projects related to Retrieval-Augmented Generation (RAG). These sub-projects include:

Investment_Bank_LLM
Medical_LLM
Merger_Retriever_LOTR
PetCare_AI
Pet_Chatbot_LLM
haystack_chainlit

News Recommender System

Problem Statement: iPrint being a cutting-edge company, is trying to solve the issue of revenue leakage by personalising user tastes and introducing new content to its users at the start of the day on the home page of the application. iPrint is planning to assess these recommendations by tracking whether the user clicks on those items or not.

Data pre-processing, EDA, Recommendation techniques such as User-based collaborative filtering, Item-based collaborative filtering, Content-based filtering, ALS (Alternation Least Squares), Hybrid recommendation system were performed Model evaluation - RMSE, MAE, precision@k and global precision@k to assess the overall performance of the recommendation system.

Plant Seedlings Classification

Problem Statement: The Aarhus University Signal Processing group, in collaboration with University of Southern Denmark, has recently released a dataset containing images of approximately 960 unique plants belonging to 12 species at several growth stages.

We are provided with a training set and a test set of images of plant seedlings at various stages of grown. Each image has a filename that is its unique id. The dataset comprises 12 plant species. The goal is to create a classifier capable of determining a plant’s species from a photo.

DMSP Particle Precipitate Flux Prediction (Mesoscale)

Problem Statement: We advance the modeling capability of electron particle precipitation from the magnetosphere to the ionosphere through a new database and use of deep learning (DL) tools to gain utility from those data. We create and apply a new framework for space weather model evaluation that culminates previous guidance from across the solar‐terrestrial research community.

The target feature is ELE_TOTAL_ENERGY_FLUX which is a continuous variable. The task is to predict this variable based on the other 154 features step-by-step by going through each day’s task. The scoring metric is RMSE or R2 score.

Patient Survival Prediction

Problem Statement: Getting a rapid understanding of the context of a patient’s overall health has been particularly important during the COVID-19 pandemic as healthcare workers around the world struggle with hospitals overloaded by patients in critical condition. Intensive Care Units (ICUs) often lack verified medical histories for incoming patients. A patient in distress or a patient who is brought in confused or unresponsive may not be able to provide information about chronic conditions such as heart disease, injuries, or diabetes. Medical records may take days to transfer, especially for a patient from another medical provider or system. Knowledge about chronic conditions can inform clinical decisions about patient care and ultimately improve patient’s survival outcomes.

The target feature is ‘hospital_death’ which is a binary variable. The task is to classify this variable based on the other 84 features step-by-step by going through each day’s task. The scoring metric is Accuracy/Area under ROC curve.

Gesture Recognition

Problem statement: As a data scientist at a home electronics company which manufactures state of the art smart televisions. We want to develop a cool feature in the smart-TV that can recognise five different gestures performed by the user which will help users control the TV without using a remote. The training data consists of a few hundred videos categorized into one of the five classes. Each video (typically 2-3 seconds long) is divided into a sequence of 30 frames (images). These videos have been recorded by various people performing one of the five gestures in front of a webcam - similar to what the smart TV will use. Our task is to train different models on the ‘train’ folder to predict the action performed in each sequence or video.

Melanoma Detection

Problem statement: To build a CNN based model which can accurately detect melanoma. Melanoma is a type of cancer that can be deadly if not detected early. It accounts for 75% of skin cancer deaths. A solution which can evaluate images and alert the dermatologists about the presence of melanoma has the potential to reduce a lot of manual effort needed in diagnosis.

The dataset consists of 2357 images of malignant and benign oncological diseases, which were formed from the International Skin Imaging Collaboration (ISIC). All images were sorted according to the classification taken with ISIC, and all subsets were divided into the same number of images, with the exception of melanomas and moles, whose images are slightly dominant.

Telecom Churn Case Study

In the telecom industry, customers are able to choose from multiple service providers and actively switch from one operator to another. In this highly competitive market, the telecommunications industry experiences an average of 15-25% annual churn rate. Given the fact that it costs 5-10 times more to acquire a new customer than to retain an existing one, customer retention has now become even more important than customer acquisition.

For many incumbent operators, retaining high profitable customers is the number one business goal. To reduce customer churn, telecom companies need to predict which customers are at high risk of churn. In this project, you will analyze customer-level data of a leading telecom firm, build predictive models to identify customers at high risk of churn.

The goal is to build a machine learning model that is able to predict churning customers based on the features provided for their usage.

House Price Prediction using Regularization (Ridge and Lasso Regression)

A US-based housing company named Surprise Housing has decided to enter the Australian market. The company uses data analytics to purchase houses at a price below their actual values and flip them on at a higher price. For the same purpose, the company has collected a data set from the sale of houses in Australia. The data is provided in the CSV file below.

The company is looking at prospective properties to buy to enter the market. You are required to build a regression model using regularisation in order to predict the actual value of the prospective properties and decide whether to invest in them or not.

A US bike-sharing provider BoomBikes has recently suffered considerable dips in their revenues due to the ongoing Corona pandemic. It is required to model the demand for shared bikes with the available independent variables. It will be used by the management to understand how exactly the demands vary with different features. They can accordingly manipulate the business strategy to meet the demand levels and meet the customer’s expectations. Further, the model will be a good way for management to understand the demand dynamics of a new market.

Analysis of Mortgage-Backed Securities Prepayment Ability

A Machine Learning Project that can predict whether the customer will default the loan or not. The primary dataset is Feddie Mac’s Single Family Loan-Level Dataset.This dataset contains approximately 500,000 mortgages originated between January 1, 1999 and September 30th, 2020. The dataset mainly consists of fixed-rate single family mortgages loans with a maturity of thirty years.

Exploratory Data Analysis on LendingClub Case Study

The data given contains information about past loan applicants and whether they ‘defaulted’ or not. The aim is to identify patterns which indicate if a person is likely to default, which may be used for taking actions such as denying the loan, reducing the amount of loan, lending (to risky applicants) at a higher interest rate, etc. through Exploratory Data Analysis (EDA) . Thus, we have to understand how consumer attributes and loan attributes influence the tendency of default.

Recommendation System Deployment Through Streamlit

A General purpose Recommender System that can recommend any items such as Movies, Receipes, Books and MOOCs. Used cosine similarity, TF-IDF Vectorization and K Nearest Neighbours to recommend top items from a list of large items. At the end, deployed the machine learning model using Streamlit Sharing to create a Recommender System application.

Twitter Sentiment Analysis

Used sentiment analysis to analyze the tweets and segregate them into various segments like positive, negative and neutral tweets with a model accuracy of 82%.

Yelp Reviews Sentiment Analysis

Used EDA Augmentation and various algorithms such as Random Forest classifier, Logistic Regression, XGBoost, Naive Bayes and Support Vector Machine (SVD) for comparing accuracies and other evaluation metrics.

Document Similarity Measures

Measured similarity between text documents by treating the text the words in the documents as vectors and using various similarity measures such as FuzzyWuzzy, Cosine Similarity and Jaccard Similarity.