literature review for using arXiv as a corpus for analysis

Published 2020-05-31T02:44:00.001Z by Physics Derivation Graph

"Towards Machine-assisted Meta-Studies: The Hubble Constant"
"an approach for automatic extraction of measured values from the astrophysical literature, using the Hubble constant for our pilot study. Our rules-based model – a classical technique in natural language processing – has successfully extracted 298 measurements of the Hubble constant, with uncertainties, from the 208,541 available arXiv astrophysics papers."

"Scienceography: the study of how science is written" (2013)
Focused on characterization
separates out packages, comments, authors, figures in the .tex source

"Transforming the arχiv to XML" (2008)

"An Architecture for Recovering Meaning in a LATEX to OMDoc Conversion" (2009)
undergrad thesis; describes processing pipeline for arXiv to OMDoc using LatexML
Kohlhase's student

"Delineating Fields Using Mathematical Jargon"

"On the Use of ArXiv as a Dataset" (2019)
primarily characterization of arXiv

"Plagiarism Detection in arXiv"