plan for parsing math latex expressions from arxiv

Published 2020-05-26T21:37:00.004Z by Physics Derivation Graph

The arxiv content is available through AWS S3: https://arxiv.org/help/bulk_data_s3
As an alternative to S3, arxiv points to a subset that's available without going through AWS: https://www.cs.cornell.edu/projects/kddcup/datasets.html

The value of having a large number of expressions in Latex is that we could use the expressions to predict what a user wants to enter, decreasing the amount of manual entry required. Also, if a derivation contains similar expressions to what exists in the arxiv content, we could investigate whether the derivation is related to the arxiv paper.

Steps for working with arxiv data

Download papers (in .tex format) for a given domain.

For each tex file, separate the text content from the math from the latex commands.
Task: identify all latex commands.
Task: identify latex commands that alter the math latex content (e.g., \newcommand)

Before attempting to parse the math latex content, remove all presentation-related artifacts

replace '\left(' with '('
replace '\right)' with ')'
replace '\ ' with ' '
replace '\,' with ' '
replace '\quad' with ' '
replace '\qquad' with ' '

Task: identify all non-math commands used in math latex.

Sources to help with parsing math latex:

within the math latex string to parse, what can be deduced about the expected context?
given other math expressions in the same paper, what would be consistent?
given the text in a paper surrounding the math expressions, what would be expected based on keywords?
given other papers in the same domain or based on citations, what would be likely?
what is statistically likely give the corpus of all articles?

Use the Trie data structure to determine what the valid characters in the grammar should be. (Probably be some subset of ASCII with some Unicode chars.)
What are the tokens/symbols of the language?
What are the common sequences of tokens?
What are the appropriate labels for the tokens?
Instead of listing 10 different relational operators each time, create a group of relational operators and reference the group.
What are some logical grouping of symbols?

Parsing a LaTeX expression should return candidate SymPy expressions with a probability. In case of unambiguous matching, only one expression should match (p=1). In the case of ambiguous matching, two or more SymPy expressions some probability (p_1 + p_2 = 1).

That is, in some sense, the same process a human goes through to decode the intended meaning of any given math expression in a scientific paper. We are looking to encode that process as a Python program.