I have to do a network analysis starting from a database of approximately 90 different csv files, each refers to an institute and every row of the dataset represent a scientific article published by the institute.
Its organized like this (simplifying):
Article.Code | Authors | Pages | Research.Categories | Year | Citations | etc…
The aim is to study the grade of collaborations amid Institutes based on the number of articles published together, since every article has a unique code of identification, if we find two rows in two different databases with the same ArticleCode it means that the institutes had a collaboration for the publication and studies referred to the article.
Since is my first time ever using R I’ve encountered different problems in order to achieve :
- A database that contains the number of articles published in common between two institutes by year, with a separation of the total number of articles per Research Areas.
InstituteName1 | InstituteName2 | year | #ArticlesInCommon | ResearchArea1 | ResearchArea 2 |…
- Another issue that I’ve encountered is about defining the Research Area (5 in total) of an article since every article could have a combination of different research categories (here a link to better understand Web of Science Core Collection Help )
I’m pretty sure that I’ve to insert a new column for each file and filling it with the name of the institute. The final scope is to have a graph of the network of collaborations between institutes and then analyze it, I've already seen that R offers packages that consent to do that.
Since the institutes are approximately 90, if I want to analyze every single collaboration between them in couple I’ve to analyze (and do steps in R) something like 4000 connections.
C(90,2)= 90!/(90-2)!2!
If I spend just 3 minutes for doing the steps and processing the dataset in R I’ll spend 200 hours! I’m sure that exists some way to do it more efficiently and faster :’)