How to extract sentences containing citations from scientific pdf articles?

alfonsorre · January 18, 2019, 4:25pm

Hi , I have 100 scientific articles in pdf. For each pdf , I have to extract all the sentences that contain quotes.
For example, if the text is the following:

social media and the broad adoption of the Web introduce feedback, reviews and user comments as consumable web content. Organizations can benefit from analyzing these user inputs to provide better services, refine product designs, improve the user experience, and manage overall organization performance. User input is often presented online and, in the case of Twitter, opinions are expressed in real time or almost in real time with the possibility of reaching a very large audience in a few seconds. Overall, as stated by Poria, Cambria and Gelbukh (2016), "the opportunity to capture the opinion of the general public ... has aroused growing interest both for the scientific community and for the business world". To analyze user input, organizations use the analysis or opinion of feelings mining tools. Sentiment analysis is defined as "the task of finding authors' opinions on specific entities" (Feldman, 2013).Examples of commercial sentiment services that offer applica- tions to process datasets of these sizes include Lexalytics, Con- verseon, and Summize ( Jansen, Zhang, Sobel, & Chowdury, 2009 ).

the output that interests me is:

Overall, as stated by Poria, Cambria and Gelbukh (2016),

Sentiment analysis is defined as "the task of finding authors' opinions on specific entities" (Feldman, 2013)

Examples of commercial sentiment services that offer applica- tions to process datasets of these sizes include Lexalytics, Con- verseon, and Summize ( Jansen, Zhang, Sobel, & Chowdury, 2009 ).

After which I have to save output in csv.

I apologize for some syntax errors.
Thank you all for the answers.

EconKid · January 18, 2019, 5:34pm

I catch your point. Here is a package pdftools which can help you to extract all texts in a pdf document.

And then, I think you need do some regular expression, like removing the \n, using str_split to split the lines. Because the unit you tidy up the texts is line.

Finally, using regular expression to subset the lines with citation, I think of the three steps above.

system · February 8, 2019, 5:34pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.