We gratefully acknowledge support from
the Simons Foundation and member institutions.
Full-text links:


Current browse context:


Change to browse by:

References & Citations

DBLP - CS Bibliography


(what is this?)
CiteULike logo BibSonomy logo Mendeley logo del.icio.us logo Digg logo Reddit logo ScienceWISE logo

Computer Science > Digital Libraries

Title: Gextext: Unsupervised Knowledge Modelling in Biomedical Literature

Authors: Robert O'Shea
Abstract: PURPOSE: Literature review is a complex task, requiring the expert analysis of unstructured data. Computational automation of this process presents a valuable opportunity for high throughput knowledge extraction and meta analysis. Currently available methods are limited to the detection of explicit and short-context relationships. We address this challenge with Gextext, which extracts a knowledge graph of latent relationships directly from unstructured text. METHODS: Let C be a corpus of n text chunks. Let V_target be a set of query terms and V_random a random selection of terms in C. Let X indicate the occurrence of V_target and V_random in C. Gextext learns a graph G(V,E) by correlation thresholding on the covariance matrix of X, where thresholds are estimated by the correlations with randomly selected terms. Gextext was benchmarked against GloVE in tasks where embedding distance matrices were correlated against real world similarity matrices. A general corpus was generated from 5,000 randomly selected Wikipedia articles and a biomedical corpus from 961 research papers on stroke. RESULTS: Embeddings generated by Gextext preserved relative geographical distances between countries (Gextext: rho = 0.255, p < 2.22e-16; GloVE: rho = 0.086, p = 1.859e-09) and capital cities (Gextext: rho = 0.282, p < 2.22e-16 ; Glove: rho = 0.093, p = 8.0805e-11). Gextext embeddings organised drug names by shared target (Gextext: rho = 0.456, p < 2.22e-16; GloVE: rho = 0.091, p = 0.00087) and stroke phenotypes by body system (Gextext: rho = 0.446, p < 2.22e-16; GloVE: rho = 0.129, p = 1.7464e-11). CONCLUSIONS: Gextext extracts latent relationships from unstructured text, enabling fully unsupervised automation of the literature review process.
Subjects: Digital Libraries (cs.DL); Computation and Language (cs.CL)
Cite as: arXiv:1911.02562 [cs.DL]
  (or arXiv:1911.02562v1 [cs.DL] for this version)

Submission history

From: Robert O'Shea [view email]
[v1] Wed, 6 Nov 2019 10:57:38 GMT (516kb)
[v2] Tue, 17 Dec 2019 10:15:39 GMT (379kb)

Link back to: arXiv, form interface, contact.