Strava Data - R Views Submission

duringju211 · September 20, 2021, 3:38pm

Category: Other
Repo: GitHub - duju211/pin_strava

I am a vivid runner and cyclist. Since a couple of years, I’m recording
almost all my activities with some kind of GPS device.

I record my runs with a Garmin device and my bike rides with a Wahoo
device. Both accounts get synchronized with my Strava account. I figured
that it would be nice to directly access my data from my Strava account.

In the following text, I will describe the progress to get the data into
R.

In this analysis, the following packages are used:

library(tarchetypes)
library(conflicted)
library(tidyverse)
library(lubridate)
library(jsonlite)
library(targets)
library(httr)
library(pins)
library(httr)
library(fs)

conflict_prefer("filter", "dplyr")

Data

The whole data pipeline is implemented with the help of the targets
package. Here you can learn more
about the package and its functionalities.

Target Plan

The manifest of the target plan looks like this:

name	command	pattern	cue_mode
my_app	define_strava_app()	NA	thorough
my_endpoint	define_strava_endpoint()	NA	thorough
act_col_types	list(moving = col_logical(), velocity_smooth = col_number(), grade_smooth = col_number(), distance = col_number(), altitude = col_number(), heartrate = col_integer(), time = col_integer(), lat = col_number(), lng = col_number(), cadence = col_integer(), watts = col_integer())	NA	thorough
my_sig	define_strava_sig(my_endpoint, my_app)	NA	never
df_act_raw	read_all_activities(my_sig)	NA	thorough
df_act	pre_process_act(df_act_raw, athlete_id)	NA	thorough
act_ids	pull(distinct(df_act, id))	NA	thorough
df_meas	read_activity_stream(act_ids, my_sig)	map(act_ids)	never
df_meas_all	bind_rows(df_meas)	NA	thorough
df_meas_wide	meas_wide(df_meas_all)	NA	thorough
df_meas_pro	meas_pro(df_meas_wide)	NA	thorough
gg_meas	vis_meas(df_meas_pro)	NA	thorough
gg_meas_save	save_gg_meas(gg_meas)	NA	thorough

The most important targets of the plan are described in detail in the
following subsections.

OAuth Dance from R

To get access to your Strava data from R, you have to create a Strava
api. How to do this is documented
here.

The Strava api requires a so called OAuth dance. How this can be done
from within R is described in the following section.

Create an OAuth Strava app:

name	command	pattern	cue_mode
my_app	define_strava_app()	NA	thorough

define_strava_app <- function() {
  oauth_app(
    appname = "r_api",
    key = Sys.getenv("STRAVA_KEY"),
    secret = Sys.getenv("STRAVA_SECRET"))
}

You can find your STRAVA_KEY and STRAVA_SECRET variables under the
Strava api settings after you have created your own personal api. The
name of api is determined during creation. In my case I named it
r_api.

Define an endpoint:

name	command	pattern	cue_mode
my_endpoint	define_strava_endpoint()	NA	thorough

define_strava_endpoint <- function() {
  oauth_endpoint(
    request = NULL,
    authorize = "https://www.strava.com/oauth/authorize",
    access = "https://www.strava.com/oauth/token")
}

The authorize parameter describes the authorization url. And the
access argument is used to exchange the authenticated token.

The final authentication step. Before the user can execute the following
steps, he has to authenticate the api in the web browser.

name	command	pattern	cue_mode
my_sig	define_strava_sig(my_endpoint, my_app)	NA	always

define_strava_sig <- function(endpoint, app) {
  oauth2.0_token(
    endpoint, app,
    scope = "activity:read_all,activity:read,profile:read_all",
    type = NULL, use_oob = FALSE, as_header = FALSE,
    use_basic_auth = FALSE, cache = FALSE)
}

The information in my_sig can now be used to access Strava data. Set
the cue_mode of the target to ‘always’, so that the user has to
authenticate and the following api calls are all executed with an up to
date authorization token.

Activities

We are now authenticated and can directly access Strava data. At first
load an overview table of all available activities. Because the total
number of activities is unknown, use a while loop. Break the execution
of the loop, if there are no more activities to read.

name	command	pattern	cue_mode
df_act_raw	read_all_activities(my_sig)	NA	thorough

read_all_activities <- function(sig) {
  activities_url <- parse_url(
    "https://www.strava.com/api/v3/athlete/activities")

  act_vec <- vector(mode = "list")
  df_act <- tibble::tibble(init = "init")
  i <- 1L

  while (nrow(df_act) != 0) {
    r <- activities_url %>%
      modify_url(
        query = list(
          access_token = sig$credentials$access_token[[1]],
          page = i)) %>%
      GET()

    df_act <- content(r, as = "text") %>%
      fromJSON(flatten = TRUE) %>%
      as_tibble()
    if (nrow(df_act) != 0)
      act_vec[[i]] <- df_act
    i <- i + 1L
  }

  df_activities <- act_vec %>%
    bind_rows() %>%
    mutate(start_date = ymd_hms(start_date))
}

The resulting data frame consists of one row per activity:

## # A tibble: 592 x 60
##    resource_state name  distance moving_time elapsed_time total_elevation~ type 
##             <int> <chr>    <dbl>       <int>        <int>            <dbl> <chr>
##  1              2 "Wee~  55641          9037        10610             923. Ride 
##  2              2 "Bre~  43892.         5594         5721             606  Ride 
##  3              2 "Rad~  18244.         3454        35536             300. Ride 
##  4              2 "Sta~  28529.         4214        11105             370  Ride 
##  5              2 "Abe~     26.5           3            3               0  Ride 
##  6              2 "Men~  32175.         9077        16895             996  Ride 
##  7              2 "Boz~  29377.         5411        12836             310  Ride 
##  8              2 "Mit~   8328          1173         1259              13  Ride 
##  9              2 "Fah~  36886.         5578        11141             468  Ride 
## 10              2 "Mor~   6552.         2693         2823             104. Run  
## # ... with 582 more rows, and 53 more variables: workout_type <int>, id <dbl>,
## #   external_id <chr>, upload_id <dbl>, start_date <dttm>,
## #   start_date_local <chr>, timezone <chr>, utc_offset <dbl>,
## #   start_latlng <list>, end_latlng <list>, location_city <lgl>,
## #   location_state <lgl>, location_country <chr>, start_latitude <dbl>,
## #   start_longitude <dbl>, achievement_count <int>, kudos_count <int>,
## #   comment_count <int>, athlete_count <int>, photo_count <int>, ...

Preprocess activities. Make sure that all id columns are represented as
characters and improve the column names:

name	command	pattern	cue_mode
df_act	pre_process_act(df_act_raw, athlete_id)	NA	thorough

pre_process_act <- function(df_act_raw, athlete_id) {
  df_act <- df_act_raw %>%
    mutate(
      across(contains("id"), as.character),
      `athlete.id` = athlete_id)
}

Extract all ids of the activities:

name	command	pattern	cue_mode
act_ids	pull(distinct(df_act, id))	NA	thorough

Measurements

Read the ‘stream’ data from Strava. A ‘stream’ is a nested list (json
format) with all available measurements of the corresponding activity.

To get all available variables and turn the result into a data frame,
define a helper function. This function takes an id of an activity and
an authentication token, which we have created earlier.

name	command	pattern	cue_mode
df_meas	read_activity_stream(act_ids, my_sig)	map(act_ids)	never

read_activity_stream <- function(id, sig) {
  act_url <- parse_url(stringr::str_glue(
    "https://www.strava.com/api/v3/activities/{id}/streams"))
  access_token <- sig$credentials$access_token[[1]]

  r <- modify_url(
    act_url,
    query = list(
      access_token = access_token,
      keys = str_glue(
        "distance,time,latlng,altitude,velocity_smooth,heartrate,cadence,watts,
        temp,moving,grade_smooth"))) %>%
    GET()

  stop_for_status(r)

  fromJSON(content(r, as = "text"), flatten = TRUE) %>%
    as_tibble() %>%
    mutate(id = id)
}

The target is defined with dynamic branching which maps over all
activity ids. Define the cue mode as never to make sure, that every
target runs exactly once.

Bind the single targets into one data frame:

name	command	pattern	cue_mode
df_meas_all	bind_rows(df_meas)	NA	thorough

The data now is represented by one row per measurement series:

## # A tibble: 4,724 x 6
##    type            data              series_type original_size resolution id    
##    <chr>           <list>            <chr>               <int> <chr>      <chr> 
##  1 moving          <lgl [9,045]>     distance             9045 high       59753~
##  2 latlng          <dbl [9,045 x 2]> distance             9045 high       59753~
##  3 velocity_smooth <dbl [9,045]>     distance             9045 high       59753~
##  4 grade_smooth    <dbl [9,045]>     distance             9045 high       59753~
##  5 distance        <dbl [9,045]>     distance             9045 high       59753~
##  6 altitude        <dbl [9,045]>     distance             9045 high       59753~
##  7 time            <int [9,045]>     distance             9045 high       59753~
##  8 moving          <lgl [5,607]>     distance             5607 high       59472~
##  9 latlng          <dbl [5,607 x 2]> distance             5607 high       59472~
## 10 velocity_smooth <dbl [5,607]>     distance             5607 high       59472~
## # ... with 4,714 more rows

Turn the data into a wide format so that every activity is one row
again:

name	command	pattern	cue_mode
df_meas_wide	meas_wide(df_meas_all)	NA	thorough

meas_wide <- function(df_meas) {
  pivot_wider(df_meas, names_from = type, values_from = data)
}

## # A tibble: 592 x 14
##    series_type original_size resolution id         moving        latlng velocity_smooth
##    <chr>               <int> <chr>      <chr>      <list>        <list> <list>         
##  1 distance             9045 high       5975328478 <lgl [9,045]> <dbl ~ <dbl [9,045]>  
##  2 distance             5607 high       5947271836 <lgl [5,607]> <dbl ~ <dbl [5,607]>  
##  3 distance             3460 high       5944515311 <lgl [3,460]> <dbl ~ <dbl [3,460]>  
##  4 distance             4234 high       5936333308 <lgl [4,234]> <dbl ~ <dbl [4,234]>  
##  5 distance                4 high       5936332751 <lgl [4]>     <dbl ~ <dbl [4]>      
##  6 distance             9151 high       5937299994 <lgl [9,151]> <dbl ~ <dbl [9,151]>  
##  7 distance             5551 high       5882522249 <lgl [5,551]> <dbl ~ <dbl [5,551]>  
##  8 distance             1211 high       5882541097 <lgl [1,211]> <dbl ~ <dbl [1,211]>  
##  9 distance             2186 high       5852593500 <lgl [2,186]> <dbl ~ <dbl [2,186]>  
## 10 distance             2733 high       5843110297 <lgl [2,733]> <dbl ~ <dbl [2,733]>  
## # ... with 582 more rows, and 7 more variables: grade_smooth <list>,
## #   distance <list>, altitude <list>, time <list>, heartrate <list>,
## #   cadence <list>, watts <list>

Preprocess and unnest the data. The column latlng needs special
attention, because it contains latitude and longitude information.
Separate the two measurements before unnesting all list columns.

name	command	pattern	cue_mode
df_meas_pro	meas_pro(df_meas_wide)	NA	thorough

meas_pro <- function(df_meas_wide) {
  df_meas_wide %>%
    mutate(
      lat = map_if(
        .x = latlng, .p = ~ !is.null(.x), .f = ~ .x[, 1]),
      lng = map_if(
        .x = latlng, .p = ~ !is.null(.x), .f = ~ .x[, 2])) %>%
    select(-c(latlng, original_size, resolution)) %>%
    unnest(where(is_list))
}

After this step every row is one point in time and every column is (if
present) a measurement at this point in time.

## # A tibble: 2,111,660 x 13
##    series_type id    moving velocity_smooth grade_smooth distance altitude  time
##    <chr>       <chr> <lgl>            <dbl>        <dbl>    <dbl>    <dbl> <int>
##  1 distance    5975~ FALSE              0            1.5      0       570.     0
##  2 distance    5975~ TRUE               0            2        6.7     570.     1
##  3 distance    5975~ TRUE               0            2.2     13.5     571.     2
##  4 distance    5975~ TRUE               0            2.9     20.5     571.     3
##  5 distance    5975~ TRUE               6.9          2.1     27.6     571      4
##  6 distance    5975~ TRUE               6.9          1.4     34.7     571.     5
##  7 distance    5975~ TRUE               7            1.4     41.5     571.     6
##  8 distance    5975~ TRUE               7            0.7     48.4     571.     7
##  9 distance    5975~ TRUE               7            0.7     55.3     571.     8
## 10 distance    5975~ TRUE               6.9          1.5     62.2     571.     9
## # ... with 2,111,650 more rows, and 5 more variables: heartrate <int>,
## #   cadence <int>, watts <int>, lat <dbl>, lng <dbl>

Visualisation

Visualize the final data by displaying the geospatial information in the
data. Every facet is one activity. Keep the rest of the plot as minimal
as possible.

name	command	pattern	cue_mode
gg_meas	vis_meas(df_meas_pro)	NA	thorough

vis_meas <- function(df_meas_pro) {
  df_meas_pro %>%
    filter(!is.na(lat)) %>%
    ggplot(aes(x = lng, y = lat)) +
    geom_path() +
    facet_wrap(~ id, scales = "free") +
    theme(
      axis.line = element_blank(),
      axis.text.x = element_blank(),
      axis.text.y = element_blank(),
      axis.ticks = element_blank(),
      axis.title.x = element_blank(),
      axis.title.y = element_blank(),
      legend.position = "bottom",
      panel.background = element_blank(),
      panel.border = element_blank(),
      panel.grid.major = element_blank(),
      panel.grid.minor = element_blank(),
      plot.background = element_blank(),
      strip.text = element_blank())
}

This is a submission to the R Views Call for Documentation. For more information see rviews.rstudio.com.