Background data for: Advancing our understanding of dispersion measures in corpus research

Version 1.0

Sönning, Lukas, 2024, "Background data for: Advancing our understanding of dispersion measures in corpus research", https://doi.org/10.18710/FVHTFM, DataverseNO, V1

Learn about Data Citation Standards.

Contact Owner

Dataset Metrics

13 Downloads

Description	Dataset description This dataset contains background data and supplementary material for Sönning (forthcoming), a study that looks at the behavior of dispersion measures when applied to text-level frequency data. For the literature survey reported in that study, which examines how dispersion measures are used in corpus-based work, it includes tabular files listing the 730 research articles that were examined as well as annotations for those studies that measured dispersion in the corpus-linguistic (and lexicographic) sense. As for the corpus data that were used to train the statistical model parameters underlying the simulation study reported in that paper, the dataset contains a term-document matrix for the 49,604 unique word forms (after conversion to lower-case) that occur in the Brown Corpus. Further, R scripts are included that document in detail how the Brown Corpus XML files, which are available from the Natural Language Toolkit (Bird et al. 2009; https://www.nltk.org/), were processed to produce this data arrangement. (2023-12-19) Abstract: Related publication This paper offers a survey of recent corpus-based work, which shows that dispersion is typically measured across the text files in a corpus. Systematic insights into the behavior of measures in such distributional settings are currently lacking, however. After a thorough discussion of six prominent indices, we investigate their behavior on relevant frequency distributions, which are designed to mimic actual corpus data. Our evaluation considers different distributional settings, i.e. various combinations of frequency and dispersion values. The primary focus is on the response of measures to relatively high and low sub-frequencies, i.e. texts in which the item or structure of interest is over- or underrepresented (if not absent). We develop a simple method for constructing sensitivity profiles, which allow us to draw instructive comparisons among measures. We observe that these profiles vary considerably across distributional settings. While D and DP appear to show the most balanced response contours, our findings suggest that much work remains to be done to understand the performance of measures on items with normalized frequencies below 100 per million words. (2023-12-19)
Subject	Arts and Humanities
Keyword	dispersion, corpus linguistics, methodology, corpus design, Brown Corpus, dispersion measures, lexical dispersion, word importance, vocabulary lists, word frequency lists, text-level analysis, frequency, Juilland's D, Gries' DP, DA, English
Related Publication	Sönning, Lukas. Forthcoming. Advancing our understanding of dispersion measures in corpus research. Corpora.
License/Data Use Agreement	Custom Dataset Terms

Filter by

	1 to 6 of 6 Files	Download
	00ReadMe_understanding_dispersion.txt Plain Text - 14.9 KB Published Nov 26, 2024 2 Downloads MD5: c23972673607ac093e7294d8d54c2286 File describing the dataset	Preview "00ReadMe_understanding_dispersion.txt" Access File File Access Public Download Options Plain Text Download Metadata Data File Citation EndNote XML RIS BibTeX
	2023-12-08_dispersion_survey_all_articles.tsv Tab-Separated Values - 47.6 KB Published Nov 26, 2024 3 Downloads MD5: 3364ac343f7cd4a8237e739aaf41f7b3 Tab-delimited data table containing the 730 research articles that entered our literature survey	Preview "2023-12-08_dispersion_survey_all_articles.tsv" Access File File Access Public Download Options Tab-Separated Values Download Metadata Data File Citation EndNote XML RIS BibTeX
	2023-12-09_dispersion_survey.tsv Tab-Separated Values - 4.9 KB Published Nov 26, 2024 2 Downloads MD5: 9bcf7eb3fc40c94c0649e39f8952fa4f Tab-delimited data table containing annotations for the 38 studies in our survey that assessed dispersion	Preview "2023-12-09_dispersion_survey.tsv" Access File File Access Public Download Options Tab-Separated Values Download Metadata Data File Citation EndNote XML RIS BibTeX
	brown_dtm.tsv Tab-Separated Values - 47.8 MB Published Nov 26, 2024 2 Downloads MD5: 9078aedb23cafe3679d0b0be2807cb59 Tab-delimited document-term matrix for the word forms in the Brown Corpus	Preview "brown_dtm.tsv" Access File File Access Public Download Options Tab-Separated Values Download Metadata Data File Citation EndNote XML RIS BibTeX
	brown_tdm.tsv Tab-Separated Values - 47.8 MB Published Nov 26, 2024 2 Downloads MD5: a86da44e3100e3d6185c4929d8bc900b Tab-delimited term-document matrix for the word forms in the Brown Corpus	Preview "brown_tdm.tsv" Access File File Access Public Download Options Tab-Separated Values Download Metadata Data File Citation EndNote XML RIS BibTeX
	script_brown_data_retrieval.qmd Unknown - 6.1 KB Published Nov 26, 2024 2 Downloads MD5: cc79a805eb71773729042bd7b77d5dd8 R quarto script documenting retrieval of the data from the Brown XML files	Access File File Access Public Download Options Original File Format Download Metadata Data File Citation EndNote XML RIS BibTeX

Citation Metadata

Persistent Identifier	doi:10.18710/FVHTFM
Publication Date	2024-11-26
Title	Background data for: Advancing our understanding of dispersion measures in corpus research
Author	Sönning, Lukas (University of Bamberg) - ORCID: 0000-0002-2705-395X
Point of Contact	Use email button above to contact. Sönning, Lukas (University of Bamberg)
Description	Dataset description This dataset contains background data and supplementary material for Sönning (forthcoming), a study that looks at the behavior of dispersion measures when applied to text-level frequency data. For the literature survey reported in that study, which examines how dispersion measures are used in corpus-based work, it includes tabular files listing the 730 research articles that were examined as well as annotations for those studies that measured dispersion in the corpus-linguistic (and lexicographic) sense. As for the corpus data that were used to train the statistical model parameters underlying the simulation study reported in that paper, the dataset contains a term-document matrix for the 49,604 unique word forms (after conversion to lower-case) that occur in the Brown Corpus. Further, R scripts are included that document in detail how the Brown Corpus XML files, which are available from the Natural Language Toolkit (Bird et al. 2009; https://www.nltk.org/), were processed to produce this data arrangement. (2023-12-19) Abstract: Related publication This paper offers a survey of recent corpus-based work, which shows that dispersion is typically measured across the text files in a corpus. Systematic insights into the behavior of measures in such distributional settings are currently lacking, however. After a thorough discussion of six prominent indices, we investigate their behavior on relevant frequency distributions, which are designed to mimic actual corpus data. Our evaluation considers different distributional settings, i.e. various combinations of frequency and dispersion values. The primary focus is on the response of measures to relatively high and low sub-frequencies, i.e. texts in which the item or structure of interest is over- or underrepresented (if not absent). We develop a simple method for constructing sensitivity profiles, which allow us to draw instructive comparisons among measures. We observe that these profiles vary considerably across distributional settings. While D and DP appear to show the most balanced response contours, our findings suggest that much work remains to be done to understand the performance of measures on items with normalized frequencies below 100 per million words. (2023-12-19)
Subject	Arts and Humanities
Keyword	dispersion corpus linguistics methodology corpus design Brown Corpus dispersion measures lexical dispersion word importance vocabulary lists word frequency lists text-level analysis frequency Juilland's D Gries' DP DA English
Related Publication	Sönning, Lukas. Forthcoming. Advancing our understanding of dispersion measures in corpus research. Corpora.
Language	English
Producer	University of Bamberg https://www.uni-bamberg.de/eng-ling/
Production Date	2023-06-28
Production Location	Bamberg, Germany
Distributor	The Tromsø Repository of Language and Linguistics (TROLLing) (TROLLing) https://trolling.uit.no/
Depositor	Sönning, Lukas
Deposit Date	2023-12-19
Time Period	Start Date: 1961-01-01 ; End Date: 1961-12-31
Date of Collection	Start Date: 2023-06-14 ; End Date: 2023-06-28
Data Type	textual linguistic data; corpus data; observational data
Software	MAXQDA Plus, Version: 22.5.0 R, Version: 4.2.1
Data Source	A Standard Corpus of Present-Day Edited American English, for use with Digital Computers (the Brown Corpus). 1964, 1971, 1979. Compiled by W. N. Francis and H. Kučera. Brown University. Providence, Rhode Island. Brown Corpus XML files are available from the Natural Language Toolkit (https://www.nltk.org). The extracted words included in the data files of this dataset represent insubstantial portions of the Brown Corpus; they do not represent coherent stretches of text. Reuse of such excerpts is permitted under exceptions in IPR and database protection regulations, such as the Norwegian Copyright Act (cf. § 24 Eneretten til databaser), the EU Database Directive (cf. art 8 Rights and obligations of lawful users), and Fair use (cf. US Copyright Act).

Geospatial Metadata

Geographic Coverage	United States

Dataset Terms

License/Data Use Agreement

Our Community Norms as well as good scientific practices expect that proper credit is given via citation. Please use the data citation shown on the dataset page.

Custom Dataset Terms — the following Custom Dataset Terms have been defined for this dataset.

With the exception of the tab-delimited term-document matrix brown_tdm.tsv and the tab-delimited document-term matrix brown_dtm.tsv, the dataset "Background data for: Advancing our understanding of dispersion measures in corpus research" has been marked as dedicated to the public domain, as described here: https://creativecommons.org/publicdomain/zero/1.0/.

Our Community Norms as well as good scientific practices expect that proper credit is given via citation. Please use the data citation shown on the dataset page.

The tab-delimited term-document matrix brown_tdm.tsv and the tab-delimited document-term matrix brown_dtm.tsv contain word forms that have been extracted from the Brown Corpus, available from the Natural Language Toolkit (https://www.nltk.org/), under limitations and exceptions to IPR and database protection regulations. The contribution of the author of the present dataset to these files, as detailed in the ReadMe file, is licensed under the Creative Commons Attribution 4.0 International (CC BY 4.0) license, as described here: https://creativecommons.org/licenses/by/4.0/. Reusers should note that this license does not apply to the word forms extracted from the Brown Corpus.

Restricted Files + Terms of Access

Dataset Version	Summary	Contributors	Published on
No records found.

Edit File

This file has already been deleted (or replaced) in the current version. It may not be edited.

Restrict Access

Restricting limits access to published files. People who want to use the restricted files can request access by default. If you disable request access, you must add information about access to the Terms of Access field.

Learn about restricting files and dataset access in the User Guide.

Request Access

Enable access request

You must enable request access or add terms of access to restrict file access.

Terms of Access for Restricted Files

Save Changes

Edit Embargo

The selected file or files have already been published. Contact an administrator to change the embargo date or reason of the file or files.

Edit Retention Period

The selected file or files have already been published. Contact an administrator to change the retention period date or reason of the file or files.

Delete Files

The file will be deleted after you click on the Delete button.

Files will not be removed from previously published versions of the dataset.

Select File(s)

Please select one or more files.

Share Dataset

Share this dataset on your favorite social media networks.

Continue

Dataset Citations

Citations for this dataset are retrieved from Crossref via DataCite using Make Data Count standards. For more information about dataset metrics, please refer to the User Guide.

Sorry, no citations were found.

Inaccessible Files Selected

The selected file(s) may not be downloaded because you have not been granted access or the file(s) have a retention period that has expired or the files can only be transferred via Globus.

You may request access to any restricted file(s) by clicking the Request Access button.

Ineligible Files Selected

The selected file(s) may not be transferred because you have not been granted access or the file(s) have a retention period that has expired or the files are not Globus accessible.

You may request access to any restricted file(s) by clicking the Request Access button.

Download Options

The files selected are too large to download as a ZIP.

You can select individual files that are below the 4.7 GB download limit from the files table, or use the Data Access API for programmatic access to the files.

Select File(s)

Please select a file or files to be downloaded.

Inaccessible Files Selected

The selected file(s) may not be downloaded because you have not been granted access or the file(s) have a retention period that has expired.

Click Continue to download the files you have access to download.

Ineligible Files Selected

Some file(s) cannot be transferred. (They are restricted, embargoed, with an expired retention period, or not Globus accessible.)

Click Continue to transfer the elligible files.

Delete Dataset

Are you sure you want to delete this dataset and all of its files? You cannot undelete this dataset.

Delete Draft Version

Are you sure you want to delete this draft version? Files will be reverted to the most recently published version. You cannot undelete this draft.

Unpublished Dataset Private URL

Private URL can only be used with unpublished versions of datasets.

Unpublished Dataset Private URL

Are you sure you want to disable the Private URL? If you have shared the Private URL with others they will no longer be able to use it to access your unpublished dataset.

Delete Files

The file(s) will be deleted after you click on the Delete button.

Files will not be removed from previously published versions of the dataset.

Compute

This dataset contains restricted files you may not compute on because you have not been granted access.

Deaccession Dataset

Are you sure you want to deaccession? This is permanent and the selected version(s) will no longer be viewable by the public.

Deaccession Dataset

Are you sure you want to deaccession this dataset? This is permanent an it will no longer be viewable by the public.

Version Differences Details

Please select two versions to view the differences.

Version Differences Details

Version:
Last Updated:

Select File(s)

Please select a file or files for access request.

Select File(s)

Embargoed files cannot be accessed. Please select an unembargoed file or files for your access request.

Edit Tags

Select existing file tags or create new tags to describe your files. Each file can have more than one tag.

Request Access

You need to Log In to request access.

Dataset Terms

Please confirm and/or complete the information needed below in order to request access to files in this dataset.

This dataset is made available under the following terms. Please confirm and/or complete the information needed below in order to continue.

License/Data Use Agreement

Our Community Norms as well as good scientific practices expect that proper credit is given via citation. Please use the data citation shown on the dataset page.

Custom terms specific to this dataset Custom Dataset Terms — the following Custom Dataset Terms have been defined for this dataset.

Our Community Norms as well as good scientific practices expect that proper credit is given via citation. Please use the data citation shown on the dataset page.

Name

Institution

Position

Preview Guestbook

Upon downloading files the guestbook asks for the following information.

Guestbook Name

Collected Data

Account Information

Package File Download

Use the Download URL in a Wget command or a download manager to download this package file. Download via web browser is not recommended. User Guide - Downloading a Dataverse Package via URL

Download URL

https://dataverse.no/api/access/datafile/

Compute Batch

Clear Batch

Dataset	Persistent Identifier	Change Compute Batch

Compute Batch

Submit for Review

You will not be able to make changes to this dataset while it is in review.

Publish Dataset

Are you sure you want to republish this dataset?

Select if this is a minor or major version update.

Minor Release (1.1)

Major Release (2.0)

Publish Dataset

This dataset cannot be published until TROLLing is published by its administrator.

Publish Dataset

This dataset cannot be published until TROLLing and DataverseNO are published.

Return to Author

Return this dataset to contributor for modification.