replacing symbols in a Sympy expression and generalizing the AST
Published 2020-05-30T20:14:00.001Z by Physics Derivation Graph
Sympy's ability to convert a Latex string to a Sympy expression is useful but does not relate information about the variables in the Latex string to other resources (like dimension).
>>> import sympy >>> from sympy import Equality, Add, Symbol, Mul, Pow, Integral, Tuple >>> from sympy.parsing.latex import parse_latex
First, remove all presentation-related markup from a Latex string.
Then convert a Latex string to a Sympy expression using
>>> eq = parse_latex('a + b = c')
>>> eq
Eq(a + b, c)
In this post we will replace the variables with the reference IDs for each variable while maintaining the structure of the expression.
Lastly, evaluate the string to get a Sympy expression
>>> eq_with_id = eval(eq_str_with_id)
>>> eq_with_id
Eq(pdg3291 + pdg4942, pdg0021)
The reason this representation is useful is because of the separation of presentation from semantic structure.
And getting the symbol list is easy: >>> eq_with_id.free_symbols {pdg3291, pdg4942, pdg0021}
Example
To show why separation matters, suppose we have the Latex string
f = \int_{x_{\rm bottom}}^{x_{\rm top}} g dg
That is a challenge for Sympy's parse_latex, even though Sympy can handle semantically equivalent structures like
>>> parse_latex('f = \int_a^b g dg')
Eq(f, Integral(g, (g, a, b)))
If we happen to know that x_{\rm bottom} is a variable and we know that x_{\rm top} is a variable, then we can simplify the presentation string to a temporary string using dummy variables
>>> initial_latex_str = 'f = \int_{x_{\rm bottom}}^{x_{\rm top}} g dg'
Algorithm for Converting Latex to Semantically-meaningful expression
get a Latex string
clean the Latex by removing presentation syntax
In the cleaned Latex string, identify known variables from the PDG that the Sympy parser does not handle, e.g., r_{\rm Earth}
In the cleaned Latex string, replace each known variable with a dummy variable, e.g. d = r_{\rm Earth}, where the dummy variable does not appear in the Latex string.
eq = parse_latex(cleaned latex string with dummy variables)
replace variables and dummy variables in eq with PDG symbol ID