characterizing Latex content in arXiv.org .tex files
Published 2020-05-31T02:23:00Z by Physics Derivation Graph
how many total .tex files?
how many english words per file?
how many expressions total in the corpus?
distribution of (number of expressions) per file
what's the distribution of (ratio of words per file to expressions per file)
how many known latex symbols are present in all the expressions
what is the distribution of (expression length in characters)
what is the distribution of (known symbols per expression)
are there character sequences that are extremely rare? binary files hidden in .tex and other anomalies
This characterization step will be useful when comparing domains.
For example, if we sample another domain (e.g., quantum mechanics),
are the distributions similar or not?
If we see that the same characterization, then we can expect that the
techniques you develop are likely to apply to a novel corpus.
Establishing that the sample being used is generic means we can work
with a smaller data set (rather than "all the .tex in arXiv"). Showing
the distribution shape does not change as more .tex files are added
means convergence is possible.
If we find a domain that doesn't have a similar distributions, then we
can investigate why it is anomalous.