Reference analysis (Systemataic Review)

Hi, everybody. I copied all the references from articles that I am doing a systematic literature review. The point is, That is some way that I can identify what are the articles (or the cells) most frequent? And I saw a script that a person said that she generated graphs about it. Is it possible?

The script of the person [I didn't understand (r beginner), if I need to past the pdf articles to txt]:

# RCitation - Quick Citation Network 
# Fall 2018 
# A.R. Siders (siders@alumni.stanford.edu)

# Creates a network of the citations among a set of academic papers. 
# Rationale: If full title of Article 2 is present in text of Article 1, Article 1 cites Article 2. 
# NOTE: Will only work in fields where full, unabbreviated titles are used in reference/bibliography citation format. 
# NOTE: Will have high error rate if titles are very short or comprised of common words (e.g., paper "Vulnerability" produced many false positives). Some errors result from authors using a shortened version of a title (e.g., only text before a colon) or incorrect citations or typos. Citation networks produced are therefore approximate and to be used primarily for exploration of the data.
# NOTE: Error rate may be reduced by using only reference sections of the articles of interest, rather than full texts, but this will increase work required to prepare articles. 


# ==> FIVE STEPS TO CITATION NETWORK

# STEP 1. FORMAT INPUT
# a. Papers: Folder of papers in txt format (UTF-8) organized *in SAME ORDER* as Titles 
# b. Titles: Column of paper titles in csv spreadsheet (Column #1) *in SAME ORDER* as documents in Papers folder. Need a header cell or top title will be removed.
# Recommend naming all texts in Papers folder using author last name listed alphabetically. Organize Titles using same order.


# STEP 2.  PREP
# set working directory
setwd("C:\[name of working space]") # make sure \ not / in name
setwd("C:/Users/User/OneDrive/Adaptive Capacity Text Mining/Citation Network Test/CitationNetwork Test Data")
# load packages
install.packages(c("tm","plyr"))
library(tm)
library(plyr)


# STEP 3. LOAD INPUTS
# a. Papers 
papers<-Corpus(DirSource("[name of folder where papers located]"))
papers<-Corpus(DirSource("Papers"))
# b. Titles
titletable<-read.csv("[name of titles file].csv") #make sure column has a header
titletable<-read.csv("TestTitles.csv")
titles<-as.vector(titletable[,1])
# load functions at bottom of this script (below Step 5)

length(papers)
length(titles)

# STEP 4. RUN FUNCTION 

CitationNetwork<-CreateCitationNetwork(papers,titles)
# add date
currentDate <- Sys.Date()
csvFileName <- paste("CitationEdges",currentDate,".csv",sep="")
# save results
write.csv(CitationNetwork, file=csvFileName) 

  
# STEP 5. VISUALIZE NETWORK

# Install Gephi or other network visualization software and load CitationEdges.csv 
# Load list of titles or other spreadsheet as nodes to visualize network 
# Gephi available at https://gephi.org/


# ===> FUNCTIONS TO LOAD 

CreateCitationNetwork<-function(papers,titles){
  # prep papers corpus
  papers<-tm_map(papers, content_transformer(tolower))
  papers<-tm_map(papers, removePunctuation)
  papers<-tm_map(papers, removeNumbers)
  papers<-tm_map(papers, stripWhitespace)
  # prep titles 
  titles<-removePunctuation(titles)
  titles<-stripWhitespace(titles)
  titles<-tolower(titles)
  # create citation true/false matrix
  Cites.TF<-CiteMatrix(titles, papers)
  # format matrix into edges file 
  CitationEdges<-EdgesFormat(Cites.TF, titles)
  return(CitationEdges)
}  

# format true/false matrix into edges file 
EdgesFormat<-function(Cites.TF, titles){
  #create an empty object to put information in
  edges<-data.frame(matrix(NA), nrow=NA, ncol=NA)
  colnames(edges)<- c("Source","Target","Weight")
  for (i in 1:length(Cites.TF)){
  #for each document, run through all titles accross columns
    for (j in 1:ncol(Cites.TF)){
      # for each title, see if document [row] cited that title [column]
      if (Cites.TF[i,j]==TRUE){  #if document is cited
        temp<-data.frame(matrix(NA), nrow=NA, ncol=NA)
        colnames(temp)<- c("Source","Target","Weight")
        # first column <- document doing the citing 
        temp[1,1]<-titles[i]
        # second column <- document being cited
        temp[1,2]<-titles[j]
        # third column the yes/no [weight]
        temp[1,3]<-1  
        temp[1,4]<-"Directed"
        edges<-rbind(edges,temp)    
      } 
    }
  }  
  return(edges[-1,]) #-1 removes initial row of null values
}

# Citation true/false matrix 
CiteMatrix<-function(search.vector, Ref.corpus){
  # Creates a csv matrix with True/False for citation patterns 
  citations<-data.frame(matrix(NA, nrow = length(Ref.corpus), ncol=length(search.vector)))
  #Columns are the document being cited
  colnames(citations)<-search.vector
  #Rows are the document doing the citing 
  rownames(citations)<-search.vector
  for (i in 1:length(search.vector)){
    searchi<-search.vector[i]
    papercite<-grepl(searchi, Ref.corpus$content, fixed=TRUE)
    citations[,i]<-papercite
  }
  return(citations)
}

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.