GSOC Journey

My journey to Google Summer of Code (GSoC) 2023

September 24, 2023 | 4 min read

Introduction #

Ever since I started college, I had my eyes set on Google Summer of Code (GSoC). Towards the end of 2022, I began my search for organizations that focus on machine learning, a field I’m really passionate about. That’s when I discovered Machine Learning for Science (ML4SCI). Their past projects in machine learning fascinated me.

After weeks of hard work and dedication, I finally received the news that I had been selected for this prestigious program. It was the culmination of my efforts, from tackling a challenging task during the selection process to crafting a compelling proposal for the given problem statement.

Project Name #

Symbolic empirical representation of squared amplitudes in high-energy physics.

Project Abstract #

In particle physics, a cross section is a measure of the likelihood that particles will interact or scatter with one another when they collide. It is a fundamental quantity that is used to describe the probability of certain interaction occurring between particles. The determination of cross-sectional area necessitates the computation of the squared amplitude, as well as the averaging and summation over the internal degrees of freedom of the particles involved. This project aims to apply symbolic deep learning techniques to predict the squared amplitudes and cross section for high energy physics.

My Work #

Here’s what I’ve been working on during the GSoC program: I’ve got a bunch of data that includes something called amplitude and another thing called Feynman diagram, which are basically mathematical expressions. My job is to use this data to figure out another mathematical expression called squared amplitude. The data I have comes in two flavors: QED (Quantum ElectroDynamics) and QCD (Quantum ChromoDynamics). To make this task easier, I created a library called SYMBA_Pytorch using PyTorch. This library helps me train different types of machine learning models, specifically transformers (Encoder Decoder Architecture), and use them to predict the squared amplitude. If you want to visualize how all of this works, I’ve also put together a workflow diagram to make it easier to understand.

I’ve documented all my work in this GitHub repository, and in addition, I’ve included a code structure diagram to help you easily grasp how the code is organized.

SYMBA_Pytorch
|__ datasets
|          |__ Data.py # Code for dataset cleaning, tokenization etc.
|          |__ __init__.py
|          |__ registry.py # All the dataset must be registered
|          |__ utils.py # Helper modules
|__ engine
|         |__ __init__.py
|         |__ config.py # All the required configuration for training of models.
|         |__ plotter.py # Used for plotting the loss and accuracy
|         |__ predictor.py # Used for prediction 
|         |__ trainer.py # Used for training of models.
|         |__ utils.py # Helper modules
|__ models
|         |__ BART.py # Code for BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension
|         |__ LED.py # Code for Longformer Encoder Decoder 
|         |__ __init__.py
|         |__ registery.py # All the models must be registered 
|         |__ seq2seq_transformer.py # Code for Sequence to sequence Transformer
|__ runs
|       |__ bart-base_trainer.sh # Script to run bart-base model using terminal
|       |__ seq2seq_trainer.sh # Script to run sequence to sequence transformer using terminal
|__ symba_trainer.py # Used inside bash script for training.
|__ symba_tuner.py # Used for hyperparameter optimization using Optuna
|__ symba_example.ipynb # Example Notebook.

Future Work #

As we move forward, our focus will be on refining and extending our project. Here’s a glimpse of our future work:

We’ll explore data augmentation techniques to improve model robustness.
Experiment with advanced architectures like LongFormer for better performance.
Extend our model’s applicability to diverse high-energy physics datasets, enhancing its versatility and utility in the field.

Acknowledgement #

This project is supported by Google Summer of Code program and Machine Learning for Science. Firstly, I want to thank my mentor, Abdulhakim Alnuqaydan, Harrison Prosper, who guides me through the whole project. Secondly, I want to thank Dr. Sergie Gleyzer, Eric Reinhardt who provides super helpful suggestions for the project review. Stay tuned for updates on our progress and further discoveries in the realm of high-energy physics and machine learning.

Quantization Part 1 →

back-to-top