(June 2014) The HathiTrust Research Center (HTRC) has announced the alpha release of a new dataset, consisting of page-level features extracted from a quarter-million text volumes.
HTRC Extracted Features Dataset:
Features are data attributes defined in such a way that they can be identified by a computer and analyzed at scale. The HTRC Feature Extraction alpha dataset has already processed the underlying text, identifying headers and footers, rejoining hyphenated words, and offering page-level details such as: Term-frequency counts, per section (head/body/footer), per page; Occurrences of terms as different parts of speech; Line counts and sentence counts; and Character counts at the start or end of lines.
Read the announcement here.