count co-publications in a large dataset

I have a dataframe with + 1 million rows.
The first column has the article IDs, the second column contains the Universities and the third shows the scientific fields of the article. For example:

I want to know the number of collaborations (co-authorships) each university pair has in each scientific field. For example, how many co-authored articles do Utrecht University and Ziekenhuis Groep Twente have in the scientific field of circular society?

In short, I want to create a new dataframe with 4 columns:
ScientificField, University_a, University_b and n_collaborations

Thanks!

How do you know if an article was co-authored?

Here is some code that I think does what you want with the data set you provided. I don't know how well it will handle a much larger data set.

library(dplyr)
library(purrr)

df <- data.frame( ArticleID=c("000-071-424-063-821", "000-071-424-063-821", "000-071-424-063-821", "000-071-424-063-821", "000-071-424-063-821", 
              "000-071-424-063-821", "000-071-424-063-821", "000-071-424-063-821", "000-187-478-700-827", 
              "000-187-478-700-827", "000-187-478-700-827", "000-261-613-036-686", "000-261-613-036-686", 
              "000-261-613-036-686", "000-261-613-036-686") ,
  University=c("Radboud University Nijmegen", "Radboud University Nijmegen", "Maastricht University", "Maastricht University", 
               "Elisabeth-TweeSteden Ziekenhuis", "Elisabeth-TweeSteden Ziekenhuis", "Ziekenhuis Groep Twente", 
               "Ziekenhuis Groep Twente", "Utrecht University", "Utrecht University", "Utrecht University", "University of Groningen", 
               "University of Groningen", "University of Groningen",
               "University of Groningen"),
  ScientificField= c("Urology", "preventive health", "Urology", "preventive health", "Urology",
                     "preventive health", "Urology", "preventive health", "Management, Monitoring, Policy and Law", 
                     "Geography, Planning and Development", "circular society", "Biomaterials", "Ceramics and Composites", "Metals and Alloys", 
                     "Biomedical Engineering")
)
Collab <- inner_join(df, df, by = join_by("ArticleID","ScientificField", "University" > "University"))

CountFunc <- function(Uni) {
  tmp <- Collab |> filter(University.x == Uni | University.y == Uni) |> 
    mutate(Uni_b = ifelse(University.x == Uni, University.y, University.x)) |> 
    group_by(ScientificField, Uni_b) |>
    summarize(n_collab = n(), .groups = "drop") |> 
    mutate(Uni_b = as.character(Uni_b)) #avoids an error if there are no rows
  tmp$Uni_a = Uni
  return(tmp)
}
Univs <- unique(df$University)
AllCounts <- map(Univs, CountFunc) |> list_rbind()
AllCounts
#> # A tibble: 24 Ă— 4
#>    ScientificField   Uni_b                           n_collab Uni_a             
#>    <chr>             <chr>                              <int> <chr>             
#>  1 Urology           Elisabeth-TweeSteden Ziekenhuis        1 Radboud Universit…
#>  2 Urology           Maastricht University                  1 Radboud Universit…
#>  3 Urology           Ziekenhuis Groep Twente                1 Radboud Universit…
#>  4 preventive health Elisabeth-TweeSteden Ziekenhuis        1 Radboud Universit…
#>  5 preventive health Maastricht University                  1 Radboud Universit…
#>  6 preventive health Ziekenhuis Groep Twente                1 Radboud Universit…
#>  7 Urology           Elisabeth-TweeSteden Ziekenhuis        1 Maastricht Univer…
#>  8 Urology           Radboud University Nijmegen            1 Maastricht Univer…
#>  9 Urology           Ziekenhuis Groep Twente                1 Maastricht Univer…
#> 10 preventive health Elisabeth-TweeSteden Ziekenhuis        1 Maastricht Univer…
#> # â„ą 14 more rows

Created on 2023-10-18 with reprex v2.0.2

2 Likes

Thank you so much :slight_smile:
Worked perfectly!!

How long did it take on a million row data set?

Less than 1 minute. Very quick.

1 Like

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.