3 Wikidata and Open Citations
Jere Odell; Mairelys Lemus-Rojas; and Lucille Brys
Open Citations
Before defining the concept of “open citations,” let us pause to think about what we mean when we talk about citations in scholarly communication librarianship. A “citation” is a reference that an author makes to a work that accounts for a source of information that the author notes, disputes, or otherwise uses in a new work. To a degree, these references also give readers enough information to find the cited source–including, perhaps, the author, date, title, publication or publisher, and the relevant page number(s).
Each of the references (or citations) in a work also form a data element—at its most basic level, a reference becomes an item in a list. Lists of citations to and from works have long been valuable tools for readers. These have been published as “indexes” (especially for religious and common law use) for hundreds of years (Hass Weinberg, 1997). In parallel with the availability of computing technologies in the mid-twentieth century, Eugene Garfield proposed a citation index for science in 1955 (Small, 2018). In 1963, Garfield launched the Science Citation Index, the precursor for what would become a widely used suite of citation databases, the Web of Science (currently owned by Clarivate). Along with later systems, like Scopus and Dimensions, these indexes and databases provide a proprietary, closed system that enables citation metrics, citation networking, citation-based studies of disciplinary development, and the citation-based rankings—a set of methods and measures broadly known as bibliometrics, scientometrics, or infometrics (Hood & Wilson, 2001).
Although proprietary systems like Web of Science and Scopus provide database driven citation linking, they are far from “open.” These databases are paywalled, limited to a controlled list of sources, and largely closed to contributors. In contrast, “open citations” are neither a proprietary database nor closed to new sources—in other words, any publication that meets the technical requirements can make their references available as open citations.
Open citations, however, are more than a reference list that anyone with internet access can read. They are also linked-data that can be discovered and displayed by search tools. To be truly an “open citation,” the elements of a reference must be structured (in a machine-readable format), separate (online apart from the original source document), and open (available for reuse without restriction). In addition, to complete the linkages of a fully networked body of open citations, the cited and citing works must be identifiable with available metadata retrievable from a persistent identifier, such as a DOI (Digital Object Identifier) or a Handle (Peroni & Shotton, 2018).
Open Citations for Scholarly Communication
Citations provide authors with a method for attribution, readers with a tool for discovering related works, and researchers with an indicator of influence. As such, they are a valued feature of creating and sharing knowledge. An open access movement without attention to citations would be incomplete. Likewise, libraries that value open access to information could find reasons to participate in supporting or contributing to open citation initiatives. Libraries with scholarly communication services, however, may also find additional strategic motivations and practical applications for including a focus on open citations in their work. These may include the cost of knowledge, equity in documentation, and research metric services.
The cost of knowledge: Proprietary, closed, citation indexing and citation ranking systems are expensive on many fronts. Academic libraries and the institutions that support them purchase subscriptions to these citation databases at great cost. Like many online service providers, data about users has become an addon or even a primary source of income for academic information companies. As the open access publishing model grows (particularly in the for-profit publishing industry), companies are turning to citation data and citation-based rankings to shore up their profits. For example, Elsevier, one of the largest academic publishers, now brands itself not as a “publisher” but as “an information analytics business” (Aspesi & SPARC, 2019). While these products (like any paywalled information source) are less likely to be accessible to scholars at smaller or less affluent institutions, there are additional, downstream effects that contribute to the rising cost of knowledge. Citation-based ranking systems contribute to economies of prestige. Journals with ascending citation-based rankings increase the price of their publishing fees and their subscriptions (Guédon, 2017). At the same time, university ranking systems also gather data from proprietary citation databases and journal ranking metrics. As a result, universities that wish to rise in the rankings are likely to put direct or indirect pressure on their faculty to publish in “prestigious” or highly ranked journals. This pressure to publish in expensive, “prestigious” journals, particularly for researchers at less affluent universities, fosters incentives for gaming the system. Researchers are more likely to study topics that will get published in ranked journals and to take shortcuts to get there (Brembs, 2018). Meanwhile, those that pursue careful approaches to research at less affluent institutions or in disciplines that receive less funding, may be priced out of the academic prestige economy. This feedback loop excludes new forms of scholarship and stifles innovation in research and publishing. As our world faces global challenges to well-being (including climate change, authoritarianism, and pandemics), scientists, educators, and policy makers need access to knowledge as both consumers and producers. Exclusive, proprietary citation data systems are a barrier to developing the solutions to our global problems (Posada & Chen, 2018). In this way, closed citations are an expense that can no longer be sustained.
Equity in documentation: Because proprietary citation databases are closed and controlled, it is difficult to add a missed citation or a new source. These companies often have criteria for inclusion that exclude both new publishers and journals with infrequent publications. These companies are also more likely to include journals if they can demonstrate a strong citation record. Given that works authored by people that appear to have women’s names are less likely to be cited, this means that fields that attract women authors will struggle to get their publications indexed (Larivière & Sugimoto, 2017). The under-representation of women and scholars working on marginalized knowledge in citations has compounding effects. When authors and their works are excluded from these tools they will be less likely to be discovered, cited, and recognized for the value of their contributions to knowledge. Their data-driven profiles will be underpopulated and sources like Wikipedia will be less likely to consider them to be notable. The social effects of the under-representation of women and marginalized scholars contributes to a culture of bias that discourages and limits full participation in knowledge production. Although open citations cannot solve these broader problems, they can be more inclusive. Editors of works in emerging disciplines, for example, can choose to make their citation data open. Likewise, library metadata specialists can participate in targeted efforts to contribute open data about women authors and their works (Lemus-Rojas & Odell, 2018). These direct efforts to address inequities in bibliographic documentation are less possible in closed citation indexes.
Research metrics services: Libraries that support research institutions are sometimes asked to assist with citation-based metrics for individuals, labs, departments, and schools. In addition to the cost, proprietary, closed, citation tools make providing these services difficult. Branded citation metrics are unique to the closed databases that feed them. In other words, a metric like a Journal Impact Factor differs from a similar metric, such as a CiteScore. They rely on different databases and use different criteria for inclusion and different equations for calculating the score. Even if a particular metric is widely used, its true meaning and value is limited to the scope of the underlying data. No single metric or metrics provider can provide a comprehensive and completely reliable metrics report for a group of scholars working in a variety of fields. The movement to create a broad corpus of open citation data has begun to surpass the data available in any single proprietary citation index (Martín-Martín, 2021). Open citation data enables new metrics tools that can, for example, better address the unique scopes of specific research fields or be more inclusive of the global research landscape. When these tools are open source (along with the citation data), they contribute to improving science policy by making bibliometric studies reproducible and easier to share with decision makers (Hutchins, 2021; Sugimoto et al., 2017).
Open Citation Projects
Early open citation efforts focused on specific disciplines. For example, CiteSeerX provides open citations in computer science and related fields and CitEc focuses on references in economics. A more generalist approach is represented by a research project started in 1999 with support from the JISC and NSF, the Open Citation project (Hitchcock et al., 2002). This three year project provided some of the foundations for the idea of making all citations open. However, as a grant-funded research project, it cannot solely provide for long term sustainability. A sustainable movement for open citations requires both technical infrastructure and community engagement (Shotton, 2013). Four efforts, described below, have made significant steps toward building a broad, sustainable, multidisciplinary open citation movement, the Open Citation Corpus, the Initiative for Open Citations, iCite, and Refcat.
Open Citation Corpus: In 2010, JISC funded two grants led by David Shotton to launch a repository of open citation data, the Open Citation Corpus (Shotton, 2013). The first version, a collection of 6.3 million references to 3.4 million articles, was released in 2011. Beginning in 2016, the effort expanded to include articles indexed in Europe PubMed Central supplemented with metadata from CrossRef and ORCID (Open Citations – About, n.d.). In 2018, Open Citations released the first version of a dataset based on CrossRef’s open DOI-to-DOI citations, COCI (Peroni, 2018). The CrossRef Open Citation Index has been updated regularly and now includes more than 1.2 billion citations to over 71 million works (Open Citations, COCI, the OpenCitations Index of Crossref open DOI-to-DOI citations). The Open Citation Corpus provides access to its databases through SPARQL endpoints, APIs, and search interfaces.
Initiative for Open Citations (I4OC): Following up on an idea presented at the 2016 Conference of the Open Access Scholarly Publishing Association (COASP), I4OC was created as an advocacy organization to build momentum among publishers and other stakeholders for open citations (I4OC, “Initiative for Open Citations”). Six organizations joined as foundational members of the initiative and announced its launch in 2017: OpenCitations, the Wikimedia Foundation, PLOS, eLife, DataCite, and the Centre for Culture and Technology at Curtin University (I4OC, 2017). Working with Open Citations as a partner, I4OC has successfully encouraged a long list of publishers to make their CrossRef citations open. Prior to the efforts of I4OC and Open Citations, fewer than 1% of journals with CrossRef DOIs provided open citations (Schiermeier, 2017). In contrast, by October 2021, 88% of journals with CrossRef DOIs now provide open citations (Martín-Martín, 2021).
iCite: The U.S. National Library of Medicine (NLM) makes an open citation collection of biomedical research from PubMed Central (PMC), MedLine, Entrez, and CrossRef available in the iCite database. The National Institute of Health Open Citation Collection (NIH-OCC) began with the realization that the influence of NIH-supported work could be visible if citations to PMC articles were open. The NLM uses open citations between PMC articles and supplements these with CrossRef metadata to find citations between the larger collection of PubMed Medline articles. Using machine learning, the project also makes citations open when they are found in open works discovered by Unpaywall. Altogether, as of July 2019, iCite made over 420 million citations open (Hutchins et al., 2019).
Refcat: A project of the Internet Archive (IA), Refcat shares open citations that are an outcome of an automated extraction of data from the bibliographic catalog that supports the IA, fatcat. Refcat provides over 1.3 billion open citations. Although the overlap in coverage with the Open Citation Corpus is large, Refcat provides an additional 257 million open citations with DOIs that were not previously included in Open Citations COCI data. Refcat also provides “non-traditional” citations–including over 1.3 million citations from English Wikipedia and 20 million citations from works in Open Library (Czygan et al., 2021). Because the Internet Archive includes web pages, Refcat also provides 14 million citations to archived web pages (jefferson, 2021).
The combined data sets of COCI and the NIH-OCC now include more than half of all DOI-to-DOI citations and surpass what is available from Web of Science (Martín-Martín, 2021). The addition of open citations from Refcat to this tally should make the total number of open citations greater than what can be found in proprietary databases, such as Dimensions and Scopus. As a result, citation data may be approaching a “tipping point” where the incentives for keeping citations behind a paywall are no longer meaningful (Hutchins, 2021).
How does Wikidata complement and contribute to open citation initiatives?
Wikidata provides both an interface to discover open citations and a tool for creating and using them. While the efforts described above rely heavily on machine learning and large-scale computing projects, Wikidata is open for anyone to contribute. Many articles and their references have been added by users that have created bots to harvest records from Europe PMC and other open databases, but both the Wikidata interface and tools like the Zotero Wikidata translator for QuickStatements provide ways for individuals to contribute open citation data related to a topic or project of their choice. The open nature of Wikidata (as a knowledge base that “anyone can edit”) and the wide range of both bibliographic and non-bibliographic properties in the site enables linkages across a diversity of topics and new ways of creating and displaying citation data. For example, with Wikidata, one could use P21 (the property for sex or gender) to track gender bias in citation networks or P356 (the property for DOI) to find citations to or from a specific DOI prefix. In other words, Wikidata lowers the bar for participating in customized, linked-data projects involving open citations.
Activity
- In your Wikidata Preferences, find Gadgets. Enable: “relateditems.” This gadget adds a button to the bottom of an item’s page to display inverse statements.
- After enabling the gadget, think about a department, research center, or author from your current or former university whose research may be under-represented in for-profit citation tools (like Web of Science or Scopus). Use the Wikidata search bar to try to find an author.
- If you have enabled “relateditems” scroll to the bottom of the author’s entry and click on “show derived statements.” A list of publications (when available) will display under the label “author of.” How many works are linked to this author in Wikidata?
Additional Resources
To learn more about open citations, consider consulting the following presentations: The Wikimedia ecosystem as a key component of an open science landscape, Citations in the wild, and how we are taming them, and Unlocking references from the literature: The Initiative for Open Citations. In addition, the blog post Understanding the implications of Open Citations—How far along are we? offers a perspective on the landscape of open citations.
References
Aspesi, C. & SPARC (Scholarly Publishing and Academic Resources Coalition). (2019). The Academic Publishing Industry in 2018. Landscape analysis. https://infrastructure.sparcopen.org/landscape-analysis/the-academic-publishing-industry-in-2018
Brembs, B. (2018). Prestigious science journals struggle to reach even average reliability. Frontiers in Human Neuroscience, 12. https://www.frontiersin.org/article/10.3389/fnhum.2018.00037
Guédon, J.-C. (2017). Open access: toward the internet of the mind. https://apo.org.au/node/74479
Hass Weinberg, B. (1997). The earliest Hebrew citation indexes. Journal of the American Society for Information Science, 48(4), 318–30. https://doi.org/10.1002/(SICI)1097-4571(199704)48:4<318::AID-ASI5>3.0.CO;2-Z
Hitchcock, S., Bergmark, D., Brody, T., Gutteridge, C., Carr, L., Hall, W., Lagoze, C., & Harnad, S. (2002). Open citation linking: The way forward. D-Lib Magazine, 8(10). https://doi.org/10.1045/october2002-hitchcock
Hood, W. W., & Wilson, C. S. (2001). The literature of biometrics, scientometrics, and infometrics. Scientometrics, 52(2), 291–314. https://doi.org/10.1023/A:1017919924342
Hutchins, B. I. (2021). A tipping point for open citation data. Quantitative Science Studies, 2(2), 433–437. https://doi.org/10.1162/qss_c_00138
Hutchins, B. I., Baker, K. L., Davis, M. T., Diwersy, M. A., Haque, E., Harriman, R. M., Hoppe, T. A., Leicht, S. A., Meyer, P., & Santangelo, G. M. (2019). The NIH Open Citation Collection: A public access, broad coverage resource. PLOS Biology, 17(10), e3000385. https://doi.org/10.1371/journal.pbio.3000385
I4OC. (n.d.). Initiative for Open Citations. I4OC. Retrieved May 29, 2022 from https://i4oc.org/
I4OC. (2017). Initiative for Open Citations (I4OC) launches with early success. I4OC. https://i4oc.org/press.pdf
jefferson. (2021, October 19). Internet Archive releases Refcat, the IA Scholar Index of over 1.3 billion scholarly citations. Internet Archive Blogs. https://blog.archive.org/2021/10/19/internet-archive-releases-refcat-the-ia-scholar-index-of-over-1-3-billion-scholarly-citations/
Larivière, V. & Sugimoto, C. (2017, March 27). The end of gender disparities in science? If only it were true… CWTS. https://www.cwts.nl:443/blog?article=n-q2z294
Lemus-Rojas, M. & Odell, J. D. (2018). Building bridges with structured linked data at IUPUI University Library. InULA Notes: Indiana University Librarians Association, 30(2): 37–39. https://hdl.handle.net/1805/17975
Martín-Martín, A. (2021, October 27). Coverage of open citation data approaches parity with Web of Science and Scopus. OpenCitations blog. https://opencitations.wordpress.com/2021/10/27/coverage-of-open-citation-data-approaches-parity-with-web-of-science-and-scopus/
OpenCitations – About. (n.d.). OpenCitations. Retrieved May 29, 2022, from https://opencitations.net/about
OpenCitations – COCI, the OpenCitations Index of Crossref Open DOI-to-DOI citations. (n.d.). OpenCitations. Retrieved May 29, 2022, from https://opencitations.net/index/coci
Peroni, S. (2018, July 12). COCI, the OpenCitations Index of Crossref Open DOI-to-DOI References. OpenCitations blog. https://opencitations.wordpress.com/2018/07/12/coci/
Peroni, S., & Shotton, D. (2018, June 27). Open Citation: Definition. Figshare. https://doi.org/10.6084/M9.FIGSHARE.6683855
Posada, A. & Chen, G. (2018). Inequality in Knowledge Production: The Integration of Academic Infrastructure by Big Publishers. ELPUB 2018, Toronto, Canada. https://elpub.architexturez.net/doc/az-cf-188554
Schiermeier, Q. (2017). Initiative aims to break science’s citation paywall. Nature. https://doi.org/10.1038/nature.2017.21800
Shotton, D. (2013). Publishing: Open citations. Nature, 502(7471), 295–297. https://doi.org/10.1038/502295a
Small, H. (2018). Citation Indexing Revisited: Garfield’s Early Vision and Its Implications for the Future. Frontiers in Research Metrics and Analytics, 3. https://www.frontiersin.org/article/10.3389/frma.2018.00008
Sugimoto, C. R., Waltman, L., Larivière, V., van Eck, J. N., Boyack, K. W., Wouters, P., & de Rijcke, S. (2017). Open citations: A letter from the scientometric community to scholarly publishers. https://www.issi-society.org/open-citations-letter/