I have created a corpus of articles in R using the text mining package. I'm having trouble removing or excluding the reference section of all the documents. Does anyone know how to do this please? Thank you in advance!!!
I think this can be down with a regular expressiosn that eliminates all text following the reference section. But it's hard to give further advice without actually looking at your data.
Thank you, I thought so, but have been struggling to work out how to execute this... My data is standard research articles from Psychology journals. I have downloaded the pdf's and used PDFtools and TM packages to read the documents into R before creating a corpus. I have included the conclusion through to the first reference of one of the pdf's below, I hope this helps!
"Conclusions
The present study utilized network analysis to more precisely characterize associations between ED symptoms and multiple di- mensions of IA. Results support that feeling unsafe in one’s body may be one factor that maintains associations between IA and ED symptoms and could represent an important focus for future re- search. Results further underscore the importance of weight and shape concerns in network models of eating pathology and suggest that targeting desire to lose weight may be helpful in promoting symptom remission across ED diagnoses. Future longitudinal re- search clarifying the nature of body mistrust in EDs will be essential to appropriately inform ED interventions that may target altered IA.
References
American Psychiatric Association. (2013). Diagnostic and statistical man- ual of mental disorders (5th ed.). Washington, DC: Author."
library(tidyverse)
article <- "Conclusions\n\nThe present study utilized network analysis to more precisely characterize associations between ED symptoms and multiple di- mensions of IA. Results support that feeling unsafe in one’s body may be one factor that maintains associations between IA and ED symptoms and could represent an important focus for future re- search. Results further underscore the importance of weight and shape concerns in network models of eating pathology and suggest that targeting desire to lose weight may be helpful in promoting symptom remission across ED diagnoses. Future longitudinal re- search clarifying the nature of body mistrust in EDs will be essential to appropriately inform ED interventions that may target altered IA.\n\n
References\n\nAmerican Psychiatric Association. (2013). Diagnostic and statistical man- ual of mental disorders (5th ed.). Washington, DC: Author."
str_extract(artile, "[\\s\\S]+(?=[:space:]+References)")
"[\\s\\S]+(?=[:space:]+References)")
preserves all characters before References. You can apply this rule to multiple articles with mutate()
or some sort of loop.
This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.