replacing symbols in a Sympy expression and generalizing the AST

Published 2020-05-30T20:14:00.001Z by Physics Derivation Graph

Sympy's ability to convert a Latex string to a Sympy expression is useful but does not relate information about the variables in the Latex string to other resources (like dimension).

>>> import sympy
>>> from sympy import Equality, Add, Symbol, Mul, Pow, Integral, Tuple
>>> from sympy.parsing.latex import parse_latex

First, remove all presentation-related markup from a Latex string.

Then convert a Latex string to a Sympy expression using

>>> eq = parse_latex('a + b = c')

>>> eq

Eq(a + b, c)

In this post we will replace the variables with the reference IDs for each variable while maintaining the structure of the expression.

The structure of the expression is

>>> sympy.srepr(eq)

"Equality(Add(Symbol('a'), Symbol('b')), Symbol('c'))"

Since this is a string, we can replace each variable in the expression with a reference ID.

The set of variables in the expression can be accessed using

>>> set_of_symbols_in_eq = eq.free_symbols

>>> set_of_symbols_in_eq

{a, c, b}

We can then replace each variable with an ID

>>> eq_str_with_id = sympy.srepr(eq).replace("'a'","'pdg4942'").replace("'b'","'pdg3291'").replace("'c'","'pdg0021'")

>>> eq_str_with_id

"Equality(Add(Symbol('pdg4942'), Symbol('pdg3291')), Symbol('pdg0021'))"

Lastly, evaluate the string to get a Sympy expression

>>> eq_with_id = eval(eq_str_with_id)

>>> eq_with_id

Eq(pdg3291 + pdg4942, pdg0021)

The reason this representation is useful is because of the separation of presentation from semantic structure.

And getting the symbol list is easy:
>>> eq_with_id.free_symbols
{pdg3291, pdg4942, pdg0021}

Example

To show why separation matters, suppose we have the Latex string

f = \int_{x_{\rm bottom}}^{x_{\rm top}} g dg

That is a challenge for Sympy's parse_latex, even though Sympy can handle semantically equivalent structures like

>>> parse_latex('f = \int_a^b g dg')

Eq(f, Integral(g, (g, a, b)))

If we happen to know that x_{\rm bottom} is a variable and we know that x_{\rm top} is a variable, then we can simplify the presentation string to a temporary string using dummy variables

>>> initial_latex_str = 'f = \int_{x_{\rm bottom}}^{x_{\rm top}} g dg'

>>> tmp_latex_str = initial_latex_str.replace('x_{\rm bottom}','p').replace('x_{\rm top}','q')

>>> tmp_latex_str

'f = \\int_{p}^{q} g dg'

Caveat: the dummy variables (here p and q) cannot exist in initial_latex_str

Now we can act on the tmp_latex_str as we did in the first example

>>> eq = parse_latex(tmp_latex_str)

>>> eq_str_with_id = sympy.srepr(eq).replace("'p'","'pdg4942'").replace("'q'","'pdg3291'").replace("'g'","'pdg0021'").replace("'f'","'pdg2103'")

>>> eq_with_id = eval(eq_str_with_id)

>>> eq_with_id

Eq(pdg2103, Integral(pdg0021, (pdg0021, pdg4942, pdg3291)))

Algorithm for Converting Latex to Semantically-meaningful expression

get a Latex string
clean the Latex by removing presentation syntax
In the cleaned Latex string, identify known variables from the PDG that the Sympy parser does not handle, e.g., r_{\rm Earth}
In the cleaned Latex string, replace each known variable with a dummy variable, e.g. d = r_{\rm Earth}, where the dummy variable does not appear in the Latex string.
eq = parse_latex(cleaned latex string with dummy variables)
replace variables and dummy variables in eq with PDG symbol ID