Introduction
The volume of research articles on platforms like arXiv can be overwhelming for scholars trying to stay updated with the latest findings. This tutorial will guide you through the process of summarizing long-form arXiv articles into key points and identifying similar papers. These actions can help researchers quickly grasp the essence of a paper and contextualize it within the broader academic discourse, ensuring a comprehensive understanding and avoiding redundant research efforts.
This article has two parts:
- Generating the embeddings and building the Annoy index
- Querying index to get related papers and generating summaries
Part 1: Building the Annoy Index
Prerequisites
Before you begin, make sure you have Python 3.9 and pip installed on your system.
Steps
Python Packages Installation
Install the following Python packages using pip:
pip install numpy pandas sentence-transformers annoy flask openai
Alternatively, you can create a requirements.txt
file and install the packages using the command:
pip install -r requirements.txt
with the following contents:
numpy
pandas
sentence-transformers
annoy
flask
openai
Kaggle arXiv dataset
To proceed, create a Kaggle account and download the arXiv dataset with limited metadata from this dataset. After downloading, unzip the file to find a JSON file.
Preprocess the Data
Load your dataset and preprocess it into the desired format. Here, we're reading a JSON file containing ArXiv metadata and concatenating titles and abstracts with a '[SEP]' separator:
Generate Embeddings using SBERT
Initialize the SBERT model and generate embeddings for your preprocessed data. We're using the allenai-specter model, specially trained for scientific papers. For approximately ~2 million articles of arXiv up to December 2022, it took:
- 8+ hours on RTX3080
- 6 hours on RTX4090
- 1.5 hours on A100 (cloud)
Adjust the batch_size
based on your GPU memory:
GPU Time to generate embeddings:
RTX 3080 (16GB) - 8 hours
RTX 4090 (16GB) - 5 hours
A100 (80GB) (on cloud) - 1 hour
Index Embeddings with Annoy
Once you have the embeddings, the next step is to index them for fast similarity search. We're using the Annoy library because of its efficiency:
Alternatively, if you do not have a GPU and are okay with the arXiv snapshot up to December 2022, you can use public S3 URLs to download the necessary datasets:
dataset description
S3 URL
annoy_index.ann
Annoy Index of 2M arXiv articles using the file arxiv-metadata-oai-snapshot.json
S3 URL: link
arxiv-metadata-oai-snapshot.json
Dataset of 2M Arxiv articles downloaded from Kaggle
S3 URL: link
embeddings.npy
Embedding numpy file. Contains serialized embeddings of all 2M articles
S3 URL: link
Part 2: Summarize and Search for Similar Articles on Arxiv
Description
This tutorial will guide you through the process of summarizing a long-form arXiv article into key points, generating an idea based on it, and identifying similar papers. We will make use of:
- Sentence Transformers for embeddings
- Annoy for indexing
- The OpenAI API for generating the summary
Prerequisites
Before proceeding, ensure you have the following:
- Python 3+
- Flask for creating an endpoint
- Knowledge of JSON, Annoy, and Sentence Transformers
Steps
Step 1: Setup and Install Dependencies
First, install the required packages:
pip install flask
Step 2: Load and Preprocess Arxiv Metadata
To summarize and find similar articles, we need the dataset's metadata. The preprocess function does this by:
- Loading the JSON data
- Extracting titles and abstracts
- Combining them into sentences
Step 3: Generate Annoy Index
Annoy (Approximate Nearest Neighbors Oh Yeah) is used to search for similar vectors in large datasets. Here, we load an Annoy index given a filename.
Step 4: Search Function
The search function takes a query, computes its embedding using Sentence Transformers, and then finds the closest matches in our Annoy index.
Step 5: Display Results
Once we've found the closest matches, we need to format and display them.
Step 6: Using OpenAI for Summarization
We use OpenAI's API to generate a summary of the Arxiv article. The article, its title, abstract, and page content are sent to OpenAI.
Step 7: Flask Endpoint
We create an endpoint in Flask that processes the Arxiv URL, summarizes the article, searches for similar articles, and returns a formatted HTML response.
Step 8: Running the Flask Server
Finally, run your Flask application. Then, open a browser and navigate to:
http://127.0.0.1:5000/search?q=ARXIV_URL
replacing ARXIV_URL
with your desired Arxiv article URL.
Conclusion
You've now created a tool that summarizes Arxiv articles and finds similar articles based on their content. This tool can be extended with other features or integrated into larger applications to aid researchers and academics.
Explore more AI tutorials for all levels of expertise, and test your skills at AI Hackathons at lablab.ai community!
Leave a comment
All comments are moderated before being published.
This site is protected by hCaptcha and the hCaptcha Privacy Policy and Terms of Service apply.