To start: this was fun for me. Thanks for this cool problem!
You can use the parse
function to convert a script into an expression. Then you can go through the expression, collect any functions called, flatten subexpressions, and repeat until nothing's left to flatten.
We can identify expressions because they have a length: the number of subexpressions plus tokens they contain. Tokens are the smallest unit of a language. For example, 1 + 2
has three tokens: 1
, +
, and 2
.
get_calls <- function(filepath) {
code <- parse(filepath)
tokens <- as.list(code)
calls <- c()
while (TRUE) {
any_unpacked <- FALSE
for (ii in seq_along(tokens)) {
part <- tokens[[ii]]
# Calls always have the function name as the first element
if (is.call(part)) {
fun_token <- part[[1]]
calls <- c(calls, deparse(fun_token))
}
# Expressions have a length
if (length(part) > 1) {
tokens[[ii]] <- as.list(part)
any_unpacked <- TRUE
}
}
tokens <- unlist(tokens)
if (!any_unpacked) break
}
unique(calls)
}
Here's it run against an example script: ~/example.R
:
# ~/example.R
library(dplyr)
iris_plot <- iris %>%
mutate(id = sample(c(1:10, 99), n(), replace = TRUE)) %>%
rename_all(tolower) %>%
rename_all(stringr::str_replace, pattern = ".", replacement = "_")
p <- print
p("Hello, world!")
getFunction("message")("Hello, again!")
The result:
get_calls("~/example.R")
# [1] "library" "<-"
# [3] "p" "getFunction(\"message\")"
# [5] "%>%" "getFunction"
# [7] "rename_all" "mutate"
# [9] "sample" "c"
# [11] "n" ":"
Where the function fails:
- Functions as objects (it didn't pick up
tolower
or gsub
)
- Functions going by other names (it didn't pick up
print
)
- Functions retrieved dynamically (it didn't pick up
message
)
- Probably a bunch of other edge cases
This would only find functions defined in the script, not the ones used. But I like the idea of running the script to create the rats nest of environments. Then maybe we could pair up parsed expressions with the environments they're run in.
Definitely a lot of ways to approach this.