Published 2022-05-28T21:14:00.004Z by Physics Derivation Graph
In my previous post I outlined a sequence of steps with a negative framing about how difficult each step would be. A positive framing of the sequence is
Suppose we are at step 2 and everything in a document is correctly tokenized (or even if just a fraction of the content is tokenized). The follow-on step (3) would be to detect the definition of the tokens from the text. For example, if the variable "a" shows up in an expression, and $a$ shows up in the text, and the text is something like
"where $a$ is the number of cats in the house"Then we can deduce that "a" is defined as "number of cats in the house".
Step 4 would be to figure out if "a" is used similarly in other papers. That would indicate a relation of the papers based on the topic of the content. See for example https://arxiv.org/pdf/1902.00027.pdf
Another use case for tokenized text (in step 2) with some semantic meaning (step 3) would be to validate the expressions. If the expression is "a = b" and the two variables have different units, that means the expression is wrong.