
Projects

Predicting Household Energy Consumption
-
Analyzed and cleaned the noisy US household survey data and stored the data in a SQLite database.
-
Discovered the nonlinearity in the data by verifying the assumptions of linear regression using the residuals plot.
-
Developed an XGBoost Regressor model with an R-square value of 0.916 and served the model using a Flask REST API on AWS Elastic Beanstalk.

Prefix Tuning Language Models on Noisy Financial Data
-
Prefix tuned the GPT-2 model using a graph reasoning prefix to build a task-based chatbot for Microsoft applications.
-
Leveraged the SOLOIST framework released by Microsoft which uses an end-to-end dialogue system to prefix tune the GPT-2 model.

Graph Reasoning Prefix for a GPT-2 Based Chatbot
-
Prefix tuned the GPT-2 model using a graph reasoning prefix to build a task-based chatbot for Microsoft applications.
-
Leveraged the SOLOIST framework released by Microsoft which uses an end-to-end dialogue system to prefix tune the GPT-2 model.

EfficientNet With Attention Mechanism
-
Designed a lightweight, scalable model to detect Covid-19 Pneumonia from X-ray images by incorporating an attention mechanism for imaging in the EfficientNetB0 model.
-
Trained the model in a Siamese neural network framework with triplet loss to help segregate closely related Pneumonia types, performing better than baseline models with an F1 score of 94.94%.

Predicting Stock Movements
● Analyzed and preprocessed the stock prices and tweets of 88 companies and filtered the data based on relative stock movement percentages
● Implemented and fine-tuned the BERT base model by reinitializing the top 3 layers of BERT and used grouped layer-wise learning rate decay to improve the baseline accuracy by 7%.

Customer Segmentation using K-Means Clustering
-
Performed dimensionality reduction using the Principal Component Analysis (PCA) and used the elbow method to find the clusters required for the K-means Clustering algorithm.
-
Performed customer segmentation in an unsupervised setting and classified the customers based on their demographic factors.

Analyzing Broken Links on Stack Overflow
-
Analyzed the complete Stack Overflow data (100 GB+) using PySpark and Google BigQuery.
-
Conducted link availability tests for 14.6 million links using multithreading on AWS EC2 instances.
-
Estimated the time taken to identify and fix the broken links and analyzed the user characteristics of the post owners and the impact of the broken links on the end-users.

BERT Questing Answering model on SQUAD dataset
-
Fine-tuned the BERT base model to implement transfer learning on the SQuAD question and answering dataset and achieved an F1 score of 0.86 using Google Colab's GPU.