Persistent Identifier
|
doi:10.18710/FVHTFM |
Publication Date
|
2024-11-26 |
Title
| Background data for: Advancing our understanding of dispersion measures in corpus research |
Author
| Sönning, Lukas (University of Bamberg) - ORCID: 0000-0002-2705-395X |
Point of Contact
|
Use email button above to contact.
Sönning, Lukas (University of Bamberg) |
Description
| Dataset description
This dataset contains background data and supplementary material for Sönning (forthcoming), a study that looks at the behavior of dispersion measures when applied to text-level frequency data. For the literature survey reported in that study, which examines how dispersion measures are used in corpus-based work, it includes tabular files listing the 730 research articles that were examined as well as annotations for those studies that measured dispersion in the corpus-linguistic (and lexicographic) sense. As for the corpus data that were used to train the statistical model parameters underlying the simulation study reported in that paper, the dataset contains a term-document matrix for the 49,604 unique word forms (after conversion to lower-case) that occur in the Brown Corpus. Further, R scripts are included that document in detail how the Brown Corpus XML files, which are available from the Natural Language Toolkit (Bird et al. 2009; https://www.nltk.org/), were processed to produce this data arrangement. (2023-12-19)
Abstract: Related publication
This paper offers a survey of recent corpus-based work, which shows that dispersion is typically measured across the text files in a corpus. Systematic insights into the behavior of measures in such distributional settings are currently lacking, however. After a thorough discussion of six prominent indices, we investigate their behavior on relevant frequency distributions, which are designed to mimic actual corpus data. Our evaluation considers different distributional settings, i.e. various combinations of frequency and dispersion values. The primary focus is on the response of measures to relatively high and low sub-frequencies, i.e. texts in which the item or structure of interest is over- or underrepresented (if not absent). We develop a simple method for constructing sensitivity profiles, which allow us to draw instructive comparisons among measures. We observe that these profiles vary considerably across distributional settings. While D and DP appear to show the most balanced response contours, our findings suggest that much work remains to be done to understand the performance of measures on items with normalized frequencies below 100 per million words. (2023-12-19) |
Subject
| Arts and Humanities |
Keyword
| dispersion
corpus linguistics
methodology
corpus design
Brown Corpus
dispersion measures
lexical dispersion
word importance
vocabulary lists
word frequency lists
text-level analysis
frequency
Juilland's D
Gries' DP
DA
English |
Related Publication
| Sönning, Lukas. Forthcoming. Advancing our understanding of dispersion measures in corpus research. Corpora. |
Language
| English |
Producer
| University of Bamberg https://www.uni-bamberg.de/eng-ling/ |
Production Date
| 2023-06-28 |
Production Location
| Bamberg, Germany |
Distributor
| The Tromsø Repository of Language and Linguistics (TROLLing) (TROLLing) https://trolling.uit.no/ |
Depositor
| Sönning, Lukas |
Deposit Date
| 2023-12-19 |
Time Period
| Start Date: 1961-01-01 ; End Date: 1961-12-31 |
Date of Collection
| Start Date: 2023-06-14 ; End Date: 2023-06-28 |
Data Type
| textual linguistic data; corpus data; observational data |
Software
| MAXQDA Plus, Version: 22.5.0
R, Version: 4.2.1 |
Data Source
| A Standard Corpus of Present-Day Edited American English, for use with Digital Computers (the Brown Corpus). 1964, 1971, 1979. Compiled by W. N. Francis and H. Kučera. Brown University. Providence, Rhode Island.
Brown Corpus XML files are available from the Natural Language Toolkit (https://www.nltk.org).
The extracted words included in the data files of this dataset represent insubstantial portions of the Brown Corpus; they do not represent coherent stretches of text. Reuse of such excerpts is permitted under exceptions in IPR and database protection regulations, such as the Norwegian Copyright Act (cf. § 24 Eneretten til databaser), the EU Database Directive (cf. art 8 Rights and obligations of lawful users), and Fair use (cf. US Copyright Act). |