Physics Derivation Graph navigation Sign in

plan for parsing math latex expressions from arxiv

Published 2020-05-26T21:37:00.004Z by Physics Derivation Graph

The arxiv content is available through AWS S3: https://arxiv.org/help/bulk_data_s3
As an alternative to S3, arxiv points to a subset that's available without going through AWS: https://www.cs.cornell.edu/projects/kddcup/datasets.html

The value of having a large number of expressions in Latex is that we could use the expressions to predict what a user wants to enter, decreasing the amount of manual entry required. Also, if a derivation contains similar expressions to what exists in the arxiv content, we could investigate whether the derivation is related to the arxiv paper.

Steps for working with arxiv data

Download papers (in .tex format) for a given domain.

For each tex file, separate the text content from the math from the latex commands.
Task: identify all latex commands.
Task: identify latex commands that alter the math latex content (e.g., \newcommand)

Before attempting to parse the math latex content, remove all presentation-related artifacts
Task: identify all non-math commands used in math latex.

Sources to help with parsing math latex:
Parsing a LaTeX expression should return candidate SymPy expressions with a probability. In case of unambiguous matching, only one expression should match (p=1). In the case of ambiguous matching, two or more SymPy expressions some probability (p_1 + p_2 = 1).

That is, in some sense, the same process a human goes through to decode the intended meaning of any given math expression in a scientific paper. We are looking to encode that process as a Python program.