We gratefully acknowledge support from
the Simons Foundation and member institutions.
Full-text links:

Download:

Current browse context:

stat.OT

Change to browse by:

References & Citations

Bookmark

(what is this?)
CiteULike logo BibSonomy logo Mendeley logo del.icio.us logo Digg logo Reddit logo

Statistics > Other Statistics

Title: Race and ethnicity data for first, middle, and last names

Abstract: We provide the largest compiled publicly available dictionaries of first, middle, and last names for the purpose of imputing race and ethnicity using, for example, Bayesian Improved Surname Geocoding (BISG). The dictionaries are based on the voter files of six Southern states that collect self-reported racial data upon voter registration. Our data cover a much larger scope of names than any comparable dataset, containing roughly one million first names, 1.1 million middle names, and 1.4 million surnames. Individuals are categorized into five mutually exclusive racial and ethnic groups -- White, Black, Hispanic, Asian, and Other -- and racial/ethnic counts by name are provided for every name in each dictionary. Counts can then be normalized row-wise or column-wise to obtain conditional probabilities of race given name or name given race. These conditional probabilities can then be deployed for imputation in a data analytic task for which ground truth racial and ethnic data is not available.
Subjects: Other Statistics (stat.OT); Machine Learning (cs.LG)
Cite as: arXiv:2208.12443 [stat.OT]
  (or arXiv:2208.12443v1 [stat.OT] for this version)

Submission history

From: Evan Rosenman [view email]
[v1] Fri, 26 Aug 2022 05:27:50 GMT (4105kb,D)

Link back to: arXiv, form interface, contact.