Cracking the Code: GSoC 2024 Project with Red Hen Lab

Published On Mon May 06 2024
Cracking the Code: GSoC 2024 Project with Red Hen Lab

Google Summer of Code 2024 Red Hen Lab | by Tarun Jain | May ...

Exciting news! I am thrilled to share that I am part of the Red Hen Lab organization for Google Summer of Code 2024. This blog post serves as an introduction, and I will be providing weekly updates starting from the Coding Phase-1.

Overview

Large Language Models (LLMs) play a crucial role in the advancement of Artificial Intelligence. Leveraging Red Hen Lab's vast news data archive, these LLMs can be utilized to provide insights on news from around the globe. One key aspect for the success of LLMs is multilingual support to cater to diverse domain requirements.

Create Your Own Custom Chatbot. Train Large Language Models ...

Project Details

Red Hen Lab possesses an extensive news archive that has undergone processing through speech and natural language pipelines in past Google Summer of Code projects. The objective is to train a Large Language Model using this rich television news data, enabling it to answer global news-related queries. Moreover, the aim is to make this model accessible to a wider open-source community.

Technical Approach

My primary task involves understanding news transcript data and structuring the dataset in a Self-Instruct format commonly used in Alpaca-based datasets. Subsequently, I will focus on training a tokenizer to facilitate the language model in grasping contextual nuances essential for effective training.

Stanford CRFM

If necessary, I will utilize the base model tokenization for this purpose. The initial strategy is to employ parameter-efficient fine-tuning to train the model and evaluate its performance using the LM-Eval-harness library on benchmark datasets like HellaSWAG, MMLU, and TruthfulQA.

Future Plans

Following the benchmarking experiments, the goal is to develop a Multilingual Chatbot using Langchain that can provide answers on global news topics. Stay tuned for upcoming articles where I will share the progress of my work throughout Google Summer of Code 2024.

Connect with Me

LinkedIn: https://www.linkedin.com/in/jaintarun75/

GitHub: https://github.com/lucifertrj/

Twitter: https://twitter.com/TRJ_0751