i.e., the tags list is one column in the csv. How is this best managed with R, as the number of tags is ont always the same (so splitting in columns might not fit well), and I would like to be able to filet rows based on tags, eg only keep rows having tag1, or only keep rows with tag1 but not tag2.
Is there an easy way to handle this with R/RStdio ?
A simple approach would be to create new columns for each tag and fill those columns with 1 or 0 if that particular tag is contained in the original tag column. If you many tags, this approach could get somewhat cumbersome, but you should be able to abstract it into a function if you need.
Do I understand correctly that this function would take the third column of the csv, split it on ,, then iterate over each tag extracted to set columns to 0 or 1? I'm an R beginner, so if there are interesting pointers to do that, I'm interested.
That's what I was thinking, but then I started playing around a little I discovered the separate_rows() function from the tidyr package which essentially does all the work for you! From the documentation:
If a variable contains observations with multiple delimited values, this separates the values and places each one in its own row.
It's nearly working @mfherman!
The problem I encounter occurs at the spread step because (I think) I have pairs of identical rows in the original csv, and I get:
The form of your data frame all depends on what you need to do with. If you gave a little more detail about the final output/product you are looking for, maybe I would have some other ideas. separate would work if you don't need each tag to be in its own corresponding column. With your example data, you would get something like this:
library(tidyverse)
df <- tribble(
~code, ~value, ~tags,
"20180901","1.234,45", "tag1,tag2,tag3",
"20180905","43,50", "tag6,tag2"
)
separate(df, tags, sep = ",", into = c("t1", "t2", "t3"))
#> Warning: Expected 3 pieces. Missing pieces filled with `NA` in 1 rows [2].
#> # A tibble: 2 x 5
#> code value t1 t2 t3
#> <chr> <chr> <chr> <chr> <chr>
#> 1 20180901 1.234,45 tag1 tag2 tag3
#> 2 20180905 43,50 tag6 tag2 <NA>
I kept things simple in the beginning to focus on the problem, but here are some more details @mfherman.
I actually work on a csv file tracking my expenses, each row having six columns:
date
category (I don't use it, it is always Other)
amount
currency (always €)
note (I use that as a category, so notes are from a limited, but possibly growing set)
tags (variable number of tags from a possibly growing set)