Title: | Import, Assemble, and Deduplicate Bibliographic Datasets |
---|---|
Description: | A critical first step in systematic literature reviews and mining of academic texts is to identify relevant texts from a range of sources, particularly databases such as 'Web of Science' or 'Scopus'. These databases often export in different formats or with different metadata tags. 'synthesisr' expands on the tools outlined by Westgate (2019) <doi:10.1002/jrsm.1374> to import bibliographic data from a range of formats (such as 'bibtex', 'ris', or 'ciw') in a standard way, and allows merging and deduplication of the resulting dataset. |
Authors: | Martin Westgate [aut, cre] |
Maintainer: | Martin Westgate <[email protected]> |
License: | GPL-3 |
Version: | 0.3.0 |
Built: | 2025-02-21 02:22:06 UTC |
Source: | https://github.com/mjwestgate/synthesisr |
Systematic review searches include multiple databases
that export results in a variety of formats with overlap in
coverage between databases. To streamline the process of importing,
assembling, and deduplicating results, synthesisr
recognizes
bibliographic files exported from databases commonly used for
systematic reviews and merges results into a standardized format.
The key task performed by synthesisr
is flexible import and
presentation of bibliographic data. This is typically achieved by
read_refs()
, which can import multiple files at once and link them together
into a single data.frame
. Conversely, export is via write_refs()
. Users
that require more detailed control can use the following functions:
read_refs Read bibliographic data
write_refs Write bibliographic data
detect_ Detect file attributes
parse_ Parse a vector containing bibliographic data
clean_ Cleaning functions for author and column names
code_lookup A dataset of potential ris tags
bibliography Methods for class bibliography
format_citation Return a clean citation from a bibliography
or data.frame
add_line_breaks Set a maximum character width for strings
When importing from multiple databases, it is likely that there will be
duplicates in the resulting dataset. The easiest way to deal with this
problem in synthesisr
is using the deduplicate()
function; but this can
be risky, particularly if there are no DOIs in the dataset. To get finer
control of the deduplication process, consider using the sub-functions:
deduplicate Semi-automated duplicate removal
find_duplicates Locate potentially duplicated references
extract_unique_references Return a data.frame with only 'unique' references
review_duplicates Manually review potential duplicates
override_duplicates Manually override identified duplicates
fuzz_ Fuzzy string matching c/o fuzzywuzzy
string_ Fuzzy string matching c/o stringdist
merge_columns Synonymous with dplyr::bind_rows
Maintainer: Martin Westgate [email protected] (ORCID)
Authors:
Eliza Grames [email protected] (ORCID)
Useful links:
This function takes a vector of strings and adds line breaks
every n characters. Primarily built to be called internally by
format_citation()
, this function has been made available as it can be
useful in other contexts.
add_line_breaks(x, n = 50, max_n = NULL, html = FALSE, max_time = NULL)
add_line_breaks(x, n = 50, max_n = NULL, html = FALSE, max_time = NULL)
x |
Either a string or a vector; if the vector is not of class character
if will be coerced to one using |
n |
Numeric: The desired number of characters that should separate consecutive line breaks. |
max_n |
DEPRECATED: If provided will currently overwrite |
html |
Logical: Should the line breaks be specified in html? |
max_time |
DEPRECATED: Previously the maximum amount of time (in seconds) allowed to adjust groups until character thresholds are reached. Ignored. |
Line breaks are only added between words, so the value of n is actually a threshold value rather than being matched exactly.
Returns the input vector unaltered except for the addition of line breaks.
add_line_breaks(c("On the Origin of Species"), n = 10)
add_line_breaks(c("On the Origin of Species"), n = 10)
This is a small number of standard methods for interacting with class 'bibliography'. More may be added later.
## S3 method for class 'bibliography' summary(object, ...) ## S3 method for class 'bibliography' print(x, n, ...) ## S3 method for class 'bibliography' x[n] ## S3 method for class 'bibliography' c(...) ## S3 method for class 'bibliography' as.data.frame(x, ...) as.bibliography(x, ...) ## S3 method for class 'bibliography' as_tibble(x, ..., .rows, .name_repair, rownames)
## S3 method for class 'bibliography' summary(object, ...) ## S3 method for class 'bibliography' print(x, n, ...) ## S3 method for class 'bibliography' x[n] ## S3 method for class 'bibliography' c(...) ## S3 method for class 'bibliography' as.data.frame(x, ...) as.bibliography(x, ...) ## S3 method for class 'bibliography' as_tibble(x, ..., .rows, .name_repair, rownames)
object |
An object of class 'bibliography' |
... |
Any further information |
x |
An object of class 'bibliography' |
n |
Number of items to select/print |
.rows |
currently ignored |
.name_repair |
currently ignored |
rownames |
currently ignored |
Methods for class bibliography
tibble
or vectorCleans column and author names
clean_df(data) clean_authors(x) clean_colnames(x)
clean_df(data) clean_authors(x) clean_colnames(x)
data |
A |
x |
A vector of strings |
Returns the input, but cleaner.
df <- data.frame( X..title. = c( "EviAtlas: a tool for visualising evidence synthesis databases", "revtools: An R package to support article screening for evidence synthesis", "An automated approach to identifying search terms for systematic reviews", "Reproducible, flexible and high-throughput data extraction from primary literature"), YEAR = c("2019", "2019", "2019", "2019"), authors = c( "Haddaway et al", "Westgate", "EM Grames AND AN Stillman & MW Tingley and CS Elphick", "Pick et al") ) clean_df(df) # or use sub-functions colnames(df) <- clean_colnames(df) # colnames(df) <- clean_colnames(colnames(df)) # also works df$author <- clean_authors(df$author)
df <- data.frame( X..title. = c( "EviAtlas: a tool for visualising evidence synthesis databases", "revtools: An R package to support article screening for evidence synthesis", "An automated approach to identifying search terms for systematic reviews", "Reproducible, flexible and high-throughput data extraction from primary literature"), YEAR = c("2019", "2019", "2019", "2019"), authors = c( "Haddaway et al", "Westgate", "EM Grames AND AN Stillman & MW Tingley and CS Elphick", "Pick et al") ) clean_df(df) # or use sub-functions colnames(df) <- clean_colnames(df) # colnames(df) <- clean_colnames(colnames(df)) # also works df$author <- clean_authors(df$author)
A data frame that can be used to look up common codes for different bibliographic fields across databases and merge them to a common format.
code_lookup
code_lookup
A data.frame
with 226 obs of 12 variables
code used in search results
the order in which to rank fields in assembled results
type of bibliographic data
description of field
bibliographic field that codes correspond to
logical: If the code is used in generic ris files
logical: If the code is used in Web of Science ris files
logical: If the code is used in PubMed ris files
logical: If the code is used in Scopus ris files
logical: If the code is used in Academic Search Premier ris files
logical: If the code is used in Ovid ris files
logical: If the code used in synthesisr imports & exports
Removes duplicates using sensible defaults
deduplicate(data, match_by, method, type = "merge", ...)
deduplicate(data, match_by, method, type = "merge", ...)
data |
A |
match_by |
Name of the column in |
method |
The duplicate detection function to use; see
|
type |
How should entries be selected? Default is |
... |
Arguments passed to |
This is a wrapper function to find_duplicates()
and
extract_unique_references()
, which tries to choose some sensible defaults.
Use with care.
A data.frame
containing data identified as unique.
find_duplicates()
and extract_unique_references()
for underlying
functions.
my_df <- data.frame( title = c( "EviAtlas: a tool for visualising evidence synthesis databases", "revtools: An R package to support article screening for evidence synthesis", "An automated approach to identifying search terms for systematic reviews", "Reproducible, flexible and high-throughput data extraction from primary literature", "eviatlas:tool for visualizing evidence synthesis databases.", "REVTOOLS a package to support article-screening for evidence synthsis" ), year = c("2019", "2019", "2019", "2019", NA, NA), authors = c("Haddaway et al", "Westgate", "Grames et al", "Pick et al", NA, NA), stringsAsFactors = FALSE ) # run deduplication dups <- find_duplicates( my_df$title, method = "string_osa", rm_punctuation = TRUE, to_lower = TRUE ) extract_unique_references(my_df, matches = dups) # or, in one line: deduplicate(my_df, "title", method = "string_osa", rm_punctuation = TRUE, to_lower = TRUE)
my_df <- data.frame( title = c( "EviAtlas: a tool for visualising evidence synthesis databases", "revtools: An R package to support article screening for evidence synthesis", "An automated approach to identifying search terms for systematic reviews", "Reproducible, flexible and high-throughput data extraction from primary literature", "eviatlas:tool for visualizing evidence synthesis databases.", "REVTOOLS a package to support article-screening for evidence synthsis" ), year = c("2019", "2019", "2019", "2019", NA, NA), authors = c("Haddaway et al", "Westgate", "Grames et al", "Pick et al", NA, NA), stringsAsFactors = FALSE ) # run deduplication dups <- find_duplicates( my_df$title, method = "string_osa", rm_punctuation = TRUE, to_lower = TRUE ) extract_unique_references(my_df, matches = dups) # or, in one line: deduplicate(my_df, "title", method = "string_osa", rm_punctuation = TRUE, to_lower = TRUE)
Bibliographic data can be stored in a number of different file
types, meaning that detecting consistent attributes of those files is
necessary if they are to be parsed accurately. These functions attempt to
identify some of those key file attributes. Specifically, detect_parser()
determines which parse_ function to use; detect_delimiter()
and detect_lookup()
identify different attributes of RIS files; and
detect_year()
attempts to fill gaps in publication years from other
information stored in a tibble
.
detect_parser(x) detect_delimiter(x) detect_lookup(tags) detect_year(df)
detect_parser(x) detect_delimiter(x) detect_lookup(tags) detect_year(df)
x |
A character vector containing bibliographic data |
tags |
A character vector containing RIS tags. |
df |
a data.frame containing bibliographic data |
detect_parser()
and detect_delimiter()
return a length-1
character; detect_year()
returns a character vector listing estimated
publication years; and detect_lookup()
returns a data.frame.
revtools <- c( "", "PMID- 31355546", "VI - 10", "IP - 4", "DP - 2019 Dec", "TI - revtools: An R package to support article screening for evidence synthesis.", "PG - 606-614", "LID - 10.1002/jrsm.1374 [doi]", "AU - Westgate MJ", "LA - eng", "PT - Journal Article", "JT - Research Synthesis Methods", "" ) # detect basic attributes of ris files detect_parser(revtools) detect_delimiter(revtools) # determine which tag format to use tags <- trimws(unlist(lapply( strsplit(revtools, "- "), function(a){a[1]} ))) pubmed_tag_list <- detect_lookup(tags[!is.na(tags)]) # find year data in other columns df <- as.data.frame(parse_pubmed(revtools)) df$year <- detect_year(df)
revtools <- c( "", "PMID- 31355546", "VI - 10", "IP - 4", "DP - 2019 Dec", "TI - revtools: An R package to support article screening for evidence synthesis.", "PG - 606-614", "LID - 10.1002/jrsm.1374 [doi]", "AU - Westgate MJ", "LA - eng", "PT - Journal Article", "JT - Research Synthesis Methods", "" ) # detect basic attributes of ris files detect_parser(revtools) detect_delimiter(revtools) # determine which tag format to use tags <- trimws(unlist(lapply( strsplit(revtools, "- "), function(a){a[1]} ))) pubmed_tag_list <- detect_lookup(tags[!is.na(tags)]) # find year data in other columns df <- as.data.frame(parse_pubmed(revtools)) df$year <- detect_year(df)
Given a list of duplicate entries and a data set, this function extracts only unique references.
extract_unique_references(data, matches, type = "merge")
extract_unique_references(data, matches, type = "merge")
data |
A |
matches |
A vector showing which entries in |
type |
How should entries be selected to retain? Default is |
Returns a data.frame
of unique references.
find_duplicates()
, deduplicate()
my_df <- data.frame( title = c( "EviAtlas: a tool for visualising evidence synthesis databases", "revtools: An R package to support article screening for evidence synthesis", "An automated approach to identifying search terms for systematic reviews", "Reproducible, flexible and high-throughput data extraction from primary literature", "eviatlas:tool for visualizing evidence synthesis databases.", "REVTOOLS a package to support article-screening for evidence synthsis" ), year = c("2019", "2019", "2019", "2019", NA, NA), authors = c("Haddaway et al", "Westgate", "Grames et al", "Pick et al", NA, NA), stringsAsFactors = FALSE ) # run deduplication dups <- find_duplicates( my_df$title, method = "string_osa", rm_punctuation = TRUE, to_lower = TRUE ) extract_unique_references(my_df, matches = dups) # or, in one line: deduplicate(my_df, "title", method = "string_osa", rm_punctuation = TRUE, to_lower = TRUE)
my_df <- data.frame( title = c( "EviAtlas: a tool for visualising evidence synthesis databases", "revtools: An R package to support article screening for evidence synthesis", "An automated approach to identifying search terms for systematic reviews", "Reproducible, flexible and high-throughput data extraction from primary literature", "eviatlas:tool for visualizing evidence synthesis databases.", "REVTOOLS a package to support article-screening for evidence synthsis" ), year = c("2019", "2019", "2019", "2019", NA, NA), authors = c("Haddaway et al", "Westgate", "Grames et al", "Pick et al", NA, NA), stringsAsFactors = FALSE ) # run deduplication dups <- find_duplicates( my_df$title, method = "string_osa", rm_punctuation = TRUE, to_lower = TRUE ) extract_unique_references(my_df, matches = dups) # or, in one line: deduplicate(my_df, "title", method = "string_osa", rm_punctuation = TRUE, to_lower = TRUE)
Identifies duplicate bibliographic entries using different duplicate detection methods.
find_duplicates( data, method = "exact", group_by, threshold, to_lower = FALSE, rm_punctuation = FALSE )
find_duplicates( data, method = "exact", group_by, threshold, to_lower = FALSE, rm_punctuation = FALSE )
data |
A character vector containing duplicate bibliographic entries. |
method |
A string indicating how matching should be calculated. Either
|
group_by |
An optional vector, data.frame or list containing data to use
as 'grouping' variables; that is, categories within which duplicates should
be sought. Defaults to NULL, in which case all entries are compared against
all others. Ignored if |
threshold |
Numeric: the cutoff threshold for deciding if two strings
are duplicates. Sensible values depend on the |
to_lower |
Logical: Should all entries be converted to lower case before
calculating string distance? Defaults to |
rm_punctuation |
Logical: Should punctuation should be removed before
calculating string distance? Defaults to |
Returns a vector of duplicate matches, with attributes
listing
methods used.
string_
or fuzz_
for suitable functions
to pass to methods
; extract_unique_references
and
deduplicate
for higher-level functions.
my_df <- data.frame( title = c( "EviAtlas: a tool for visualising evidence synthesis databases", "revtools: An R package to support article screening for evidence synthesis", "An automated approach to identifying search terms for systematic reviews", "Reproducible, flexible and high-throughput data extraction from primary literature", "eviatlas:tool for visualizing evidence synthesis databases.", "REVTOOLS a package to support article-screening for evidence synthsis" ), year = c("2019", "2019", "2019", "2019", NA, NA), authors = c("Haddaway et al", "Westgate", "Grames et al", "Pick et al", NA, NA), stringsAsFactors = FALSE ) # run deduplication dups <- find_duplicates( my_df$title, method = "string_osa", rm_punctuation = TRUE, to_lower = TRUE ) extract_unique_references(my_df, matches = dups) # or, in one line: deduplicate(my_df, "title", method = "string_osa", rm_punctuation = TRUE, to_lower = TRUE)
my_df <- data.frame( title = c( "EviAtlas: a tool for visualising evidence synthesis databases", "revtools: An R package to support article screening for evidence synthesis", "An automated approach to identifying search terms for systematic reviews", "Reproducible, flexible and high-throughput data extraction from primary literature", "eviatlas:tool for visualizing evidence synthesis databases.", "REVTOOLS a package to support article-screening for evidence synthsis" ), year = c("2019", "2019", "2019", "2019", NA, NA), authors = c("Haddaway et al", "Westgate", "Grames et al", "Pick et al", NA, NA), stringsAsFactors = FALSE ) # run deduplication dups <- find_duplicates( my_df$title, method = "string_osa", rm_punctuation = TRUE, to_lower = TRUE ) extract_unique_references(my_df, matches = dups) # or, in one line: deduplicate(my_df, "title", method = "string_osa", rm_punctuation = TRUE, to_lower = TRUE)
This function takes an object of class data.frame
, list
, or
bibliography
and returns a formatted citation.
format_citation( data, details = TRUE, abstract = FALSE, add_html = FALSE, line_breaks = FALSE, ... )
format_citation( data, details = TRUE, abstract = FALSE, add_html = FALSE, line_breaks = FALSE, ... )
data |
An object of class |
details |
Logical: Should identifying information such as author names &
journal titles be displayed? Defaults to |
abstract |
Logical: Should the abstract be shown (if available)?
Defaults to |
add_html |
Logical: Should the journal title be italicized using html
codes? Defaults to |
line_breaks |
Either logical, stating whether line breaks should be
added, or numeric stating how many characters should separate consecutive
line breaks. Defaults to |
... |
any other arguments. |
Returns a string of length equal to length(data)
that contains
formatted citations.
roses <- c("@article{haddaway2018, title={ROSES RepOrting standards for Systematic Evidence Syntheses: pro forma, flow-diagram and descriptive summary of the plan and conduct of environmental systematic reviews and systematic maps}, author={Haddaway, Neal R and Macura, Biljana and Whaley, Paul and Pullin, Andrew S}, journal={Environmental Evidence}, volume={7}, number={1}, pages={7}, year={2018}, publisher={Springer} }") tmp <- tempfile() writeLines(roses, tmp) citation <- read_ref(tmp) format_citation(citation)
roses <- c("@article{haddaway2018, title={ROSES RepOrting standards for Systematic Evidence Syntheses: pro forma, flow-diagram and descriptive summary of the plan and conduct of environmental systematic reviews and systematic maps}, author={Haddaway, Neal R and Macura, Biljana and Whaley, Paul and Pullin, Andrew S}, journal={Environmental Evidence}, volume={7}, number={1}, pages={7}, year={2018}, publisher={Springer} }") tmp <- tempfile() writeLines(roses, tmp) citation <- read_ref(tmp) format_citation(citation)
These functions duplicate the approach of the 'fuzzywuzzy' Python library for calculating string similarity.
fuzzdist( a, b, method = c("fuzz_m_ratio", "fuzz_partial_ratio", "fuzz_token_sort_ratio", "fuzz_token_set_ratio") ) fuzz_m_ratio(a, b) fuzz_partial_ratio(a, b) fuzz_token_sort_ratio(a, b) fuzz_token_set_ratio(a, b)
fuzzdist( a, b, method = c("fuzz_m_ratio", "fuzz_partial_ratio", "fuzz_token_sort_ratio", "fuzz_token_set_ratio") ) fuzz_m_ratio(a, b) fuzz_partial_ratio(a, b) fuzz_token_sort_ratio(a, b) fuzz_token_set_ratio(a, b)
a |
A character vector of items to match to b. |
b |
A character vector of items to match to a. |
method |
The method to use for fuzzy matching. |
Returns a score of same length as b, giving the proportional dissimilarity between a and b.
fuzz_m_ratio()
is a measure of the number of letters that match
between two strings. It is calculated as one minus two times the number of
matched characters, divided by the number of characters in both strings.
fuzz_partial_ratio()
calculates the extent to which one string is a
subset of the other. If one string is a perfect subset, then this will be
zero.
fuzz_token_sort_ratio()
sorts the words in both strings into
alphabetical order, and checks their similarity using fuzz_m_ratio()
.
fuzz_token_set_ratio()
is similar to fuzz_token_sort_ratio()
, but
compares both sorted strings to each other, and to a third group made of
words common to both strings. It then returns the maximum value of
fuzz_m_ratio()
from these comparisons.
fuzzdist()
is a wrapper function, for compatability with stringdist
.
fuzzdist("On the Origin of Species", "Of the Original Specs", method = "fuzz_m_ratio")
fuzzdist("On the Origin of Species", "Of the Original Specs", method = "fuzz_m_ratio")
Takes two or more data.frames
with different column names or
different column orders and binds them to a single data.frame.
This
function is maintained for backwards compatibility, but it is synonymous with
dplyr::bind_rows()
and will be depracated in future.
merge_columns(x, y)
merge_columns(x, y)
x |
Either a data.frame or a list of data.frames. |
y |
A data.frame, optional if x is a list. |
Returns a single data.frame with all the input data frames merged.
df_1 <- data.frame( title = c( "EviAtlas: a tool for visualising evidence synthesis databases", "revtools: An R package to support article screening for evidence synthesis" ), year = c("2019", "2019") ) df_2 <- data.frame( title = c( "An automated approach to identifying search terms for systematic reviews", "Reproducible, flexible and high-throughput data extraction from primary literature" ), authors = c("Grames et al", "Pick et al") ) merge_columns(df_1, df_2)
df_1 <- data.frame( title = c( "EviAtlas: a tool for visualising evidence synthesis databases", "revtools: An R package to support article screening for evidence synthesis" ), year = c("2019", "2019") ) df_2 <- data.frame( title = c( "An automated approach to identifying search terms for systematic reviews", "Reproducible, flexible and high-throughput data extraction from primary literature" ), authors = c("Grames et al", "Pick et al") ) merge_columns(df_1, df_2)
Re-assign group numbers to text that was classified as duplicated but is unique.
override_duplicates(matches, overrides)
override_duplicates(matches, overrides)
matches |
Numeric: a vector of group numbers for texts that indicates
duplicates and unique values returned by the |
overrides |
Numeric: a vector of group numbers that are not true duplicates. |
The input matches
vector with unique group numbers for members
of groups that the user overrides.
Text in standard formats - such as imported via
base::readLines()
- can be parsed using a variety of standard formats. Use
detect_parser()
to determine which is the most appropriate parser for your
situation. Note that parse_tsv()
and parse_csv()
are maintained for
backwards compatability only; within read_ref
these have been replaced
by vroom::vroom()
.
parse_bibtex(x) parse_csv(x) parse_tsv(x) parse_pubmed(x) parse_ris(x, tag_naming = "best_guess")
parse_bibtex(x) parse_csv(x) parse_tsv(x) parse_pubmed(x) parse_ris(x, tag_naming = "best_guess")
x |
A character vector containing bibliographic information in ris format. |
tag_naming |
What format are ris tags in? Defaults to |
Returns an object of class bibliography
(ris, bib, or pubmed
formats) or data.frame
(csv or tsv).
eviatlas <- c( "TY - JOUR", "AU - Haddaway, Neal R.", "AU - Feierman, Andrew", "AU - Grainger, Matthew J.", "AU - Gray, Charles T.", "AU - Tanriver-Ayder, Ezgi", "AU - Dhaubanjar, Sanita", "AU - Westgate, Martin J.", "PY - 2019", "DA - 2019/06/04", "TI - EviAtlas: a tool for visualising evidence synthesis databases", "JO - Environmental Evidence", "SP - 22", "VL - 8", "IS - 1", "SN - 2047-2382", "UR - https://doi.org/10.1186/s13750-019-0167-1", "DO - 10.1186/s13750-019-0167-1", "ID - Haddaway2019", "ER - " ) detect_parser(eviatlas) # = "parse_ris" df <- as.data.frame(parse_ris(eviatlas)) ris_out <- write_refs(df, format = "ris", file = FALSE)
eviatlas <- c( "TY - JOUR", "AU - Haddaway, Neal R.", "AU - Feierman, Andrew", "AU - Grainger, Matthew J.", "AU - Gray, Charles T.", "AU - Tanriver-Ayder, Ezgi", "AU - Dhaubanjar, Sanita", "AU - Westgate, Martin J.", "PY - 2019", "DA - 2019/06/04", "TI - EviAtlas: a tool for visualising evidence synthesis databases", "JO - Environmental Evidence", "SP - 22", "VL - 8", "IS - 1", "SN - 2047-2382", "UR - https://doi.org/10.1186/s13750-019-0167-1", "DO - 10.1186/s13750-019-0167-1", "ID - Haddaway2019", "ER - " ) detect_parser(eviatlas) # = "parse_ris" df <- as.data.frame(parse_ris(eviatlas)) ris_out <- write_refs(df, format = "ris", file = FALSE)
Imports common bibliographic reference formats (i.e. .bib, .ris, or .txt).
read_refs( filename, tag_naming = "best_guess", return_df = TRUE, verbose = FALSE, locale = default_locale() )
read_refs( filename, tag_naming = "best_guess", return_df = TRUE, verbose = FALSE, locale = default_locale() )
filename |
A path to a filename or vector of filenames containing search results to import. |
tag_naming |
Either a length-1 character stating how should ris tags be
replaced (see details for a list of options), or an object inheriting from
class |
return_df |
If |
verbose |
If |
The default for argument tag_naming
is "best_guess"
,
which estimates what database has been used for ris tag replacement, then
fills any gaps with generic tags. Any tags missing from the database (i.e.
code_lookup
) are passed unchanged. Other options are to use tags from
Web of Science ("wos"
), Scopus ("scopus"
), Ovid ("ovid"
)
or Academic Search Premier ("asp"
). If a data.frame
is given,
then it must contain two columns: "code"
listing the original tags in
the source document, and "field"
listing the replacement column/tag
names. The data.frame
may optionally include a third column named
"order"
, which specifies the order of columns in the resulting
data.frame
; otherwise this will be taken as the row order. Finally,
passing "none"
to replace_tags
suppresses tag replacement.
Returns a data.frame
or list
of assembled search results.
litsearchr <- c( "@article{grames2019, title={An automated approach to identifying search terms for systematic reviews using keyword co-occurrence networks}, author={Grames, Eliza M and Stillman, Andrew N and Tingley, Morgan W and Elphick, Chris S}, journal={Methods in Ecology and Evolution}, volume={10}, number={10}, pages={1645--1654}, year={2019}, publisher={Wiley Online Library} }" ) tmp <- tempfile() writeLines(litsearchr, tmp) df <- read_refs(tmp, return_df = TRUE, verbose = TRUE)
litsearchr <- c( "@article{grames2019, title={An automated approach to identifying search terms for systematic reviews using keyword co-occurrence networks}, author={Grames, Eliza M and Stillman, Andrew N and Tingley, Morgan W and Elphick, Chris S}, journal={Methods in Ecology and Evolution}, volume={10}, number={10}, pages={1645--1654}, year={2019}, publisher={Wiley Online Library} }" ) tmp <- tempfile() writeLines(litsearchr, tmp) df <- read_refs(tmp, return_df = TRUE, verbose = TRUE)
Allows users to manually review articles classified as duplicates.
review_duplicates(text, matches)
review_duplicates(text, matches)
text |
A character vector of the text that was used to identify potential duplicates. |
matches |
Numeric: a vector of group numbers for texts that indicates
duplicates and unique values returned by the |
A data.frame
of potential duplicates grouped together.
These functions each access a specific "methods"
argument
provided by stringdist
, and are provided for convenient calling by
find_duplicates()
. They do not include any new functionality beyond that
given by stringdist
, which you should use for your own analyses.
string_osa(a, b) string_lv(a, b) string_dl(a, b) string_hamming(a, b) string_lcs(a, b) string_qgram(a, b) string_cosine(a, b) string_jaccard(a, b) string_jw(a, b) string_soundex(a, b)
string_osa(a, b) string_lv(a, b) string_dl(a, b) string_hamming(a, b) string_lcs(a, b) string_qgram(a, b) string_cosine(a, b) string_jaccard(a, b) string_jw(a, b) string_soundex(a, b)
a |
A character vector of items to match to b. |
b |
A character vector of items to match to a. |
Returns a score of same length as b, giving the dissimilarity between a and b.
This function exports data.frames containing bibliographic information to either a .ris or .bib file.
write_refs(x, file, format = "ris", tag_naming = "synthesisr", write = TRUE) write_bib(x) write_ris(x, tag_naming = "synthesisr")
write_refs(x, file, format = "ris", tag_naming = "synthesisr", write = TRUE) write_bib(x) write_ris(x, tag_naming = "synthesisr")
x |
Either a data.frame containing bibliographic information or an object of class bibliography. |
file |
filename to save to. |
format |
What format should the data be exported as? Options are ris or bib. |
tag_naming |
what naming convention should be used to write RIS files? See details for options. |
write |
Logical should a file should be written? If FALSE returns a
|
This function is typically called for it's side effect of writing a
file in the specified location and format. If write
is FALSE, returns
a character vector containing bibliographic information in the specified
format.
eviatlas <- c( "TY - JOUR", "AU - Haddaway, Neal R.", "AU - Feierman, Andrew", "AU - Grainger, Matthew J.", "AU - Gray, Charles T.", "AU - Tanriver-Ayder, Ezgi", "AU - Dhaubanjar, Sanita", "AU - Westgate, Martin J.", "PY - 2019", "DA - 2019/06/04", "TI - EviAtlas: a tool for visualising evidence synthesis databases", "JO - Environmental Evidence", "SP - 22", "VL - 8", "IS - 1", "SN - 2047-2382", "UR - https://doi.org/10.1186/s13750-019-0167-1", "DO - 10.1186/s13750-019-0167-1", "ID - Haddaway2019", "ER - " ) detect_parser(eviatlas) # = "parse_ris" df <- as.data.frame(parse_ris(eviatlas)) ris_out <- write_refs(df, format = "ris", file = FALSE)
eviatlas <- c( "TY - JOUR", "AU - Haddaway, Neal R.", "AU - Feierman, Andrew", "AU - Grainger, Matthew J.", "AU - Gray, Charles T.", "AU - Tanriver-Ayder, Ezgi", "AU - Dhaubanjar, Sanita", "AU - Westgate, Martin J.", "PY - 2019", "DA - 2019/06/04", "TI - EviAtlas: a tool for visualising evidence synthesis databases", "JO - Environmental Evidence", "SP - 22", "VL - 8", "IS - 1", "SN - 2047-2382", "UR - https://doi.org/10.1186/s13750-019-0167-1", "DO - 10.1186/s13750-019-0167-1", "ID - Haddaway2019", "ER - " ) detect_parser(eviatlas) # = "parse_ris" df <- as.data.frame(parse_ris(eviatlas)) ris_out <- write_refs(df, format = "ris", file = FALSE)