# plan for parsing math latex expressions from arxiv

Published 2020-05-26T21:37:00.004Z by Physics Derivation Graph

The arxiv content is available through AWS S3: https://arxiv.org/help/bulk_data_s3
As an alternative to S3, arxiv points to a subset that's available without going through AWS: https://www.cs.cornell.edu/projects/kddcup/datasets.html

The value of having a large number of expressions in Latex is that we could use the expressions to predict what a user wants to enter, decreasing the amount of manual entry required. Also, if a derivation contains similar expressions to what exists in the arxiv content, we could investigate whether the derivation is related to the arxiv paper.

### Steps for working with arxiv data

For each tex file, separate the text content from the math from the latex commands.
Task: identify latex commands that alter the math latex content (e.g., \newcommand)

Before attempting to parse the math latex content, remove all presentation-related artifacts
• replace '\left(' with '('
• replace '\right)' with ')'
• replace '\ ' with ' '
• replace '\,' with ' '
• replace '\quad' with ' '
• replace '\qquad' with ' '
Task: identify all non-math commands used in math latex.

Sources to help with parsing math latex:
• within the math latex string to parse, what can be deduced about the expected context?
• given other math expressions in the same paper, what would be consistent?
• given the text in a paper surrounding the math expressions, what would be expected based on keywords?
• given other papers in the same domain or based on citations, what would be likely?
• what is statistically likely give the corpus of all articles?
• Use the Trie data structure to determine what the valid characters in the grammar should be. (Probably be some subset of ASCII with some Unicode chars.)
• What are the tokens/symbols of the language?
• What are the common sequences of tokens?
• What are the appropriate labels for the tokens?
• Instead of listing 10 different relational operators each time, create a group of relational operators and reference the group.
• What are some logical grouping of symbols?
Parsing a LaTeX expression should return candidate SymPy expressions with a probability. In case of unambiguous matching, only one expression should match (p=1). In the case of ambiguous matching, two or more SymPy expressions some probability (p_1 + p_2 = 1).

That is, in some sense, the same process a human goes through to decode the intended meaning of any given math expression in a scientific paper. We are looking to encode that process as a Python program.