characterizing Latex content in arXiv.org .tex files

Published 2020-05-31T02:23:00Z by Physics Derivation Graph

how many total .tex files?
how many english words per file?
how many expressions total in the corpus?
distribution of (number of expressions) per file
what's the distribution of (ratio of words per file to expressions per file)
how many known latex symbols are present in all the expressions
what is the distribution of (expression length in characters)
what is the distribution of (known symbols per expression)
are there character sequences that are extremely rare? binary files hidden in .tex and other anomalies

This characterization step will be useful when comparing domains. For example, if we sample another domain (e.g., quantum mechanics), are the distributions similar or not? If we see that the same characterization, then we can expect that the techniques you develop are likely to apply to a novel corpus.

Establishing that the sample being used is generic means we can work with a smaller data set (rather than "all the .tex in arXiv"). Showing the distribution shape does not change as more .tex files are added means convergence is possible.

If we find a domain that doesn't have a similar distributions, then we can investigate why it is anomalous.