Generate a data frame from many xml files

richardo · June 27, 2018, 7:44am

Hi everyone!

I am trying to retrieve some xml data with Swedish election statistics and create a data frame in R out of them, but I'm not that familiar with xml files and struggle to get the result I want. Any help would be greatly appreciated.

The data is published by the Swedish Election Authority as a zipped folder with many xml files. The folder contains files for each of the 290 municipalities (files with 4 digit codes) and each election type, where the final letter in the filename indicates the type of election: R=national parliament, L=county council, K=municipal council (for county councils there is only 289 municipalities). The folder also contains 3 XML files for total results at the national level for each of the three election types. I've managed to read the files into R with the following code:

library(xml2)
library(tidyverse)

tf <- tempfile(tmpdir = tdir <- tempdir())
download.file("https://data.val.se/val/val2014/valnatt/valnatt.zip", tf)
xml_files <- unzip(tf, exdir = tdir)

The XML files with municipal data have the following structure (lines deleted for clarity):

<?xml version="1.0" encoding="ISO-8859-1"?>
<?xml-stylesheet type="text/html"?>
<!DOCTYPE VAL PUBLIC "-//Valmyndigheten//DTD Valresultat parti kommun 1.5//SV" "http://www.val.se/dtd/resultat/parti_kommun_1_5.dtd">
<VAL TILLFÄLLE="Allmänna val 14 september 2014" FILNAMN="valnatt_0114R.xml" RAPPORTERING="VALNATTSRAPPORTERING" VALTYP="Riksdagsval" VALDAG="20140914" VALDAG_FGVAL="20100919" TID_RAPPORT="20140916105203">
  <PARTI FÖRKORTNING="M" BETECKNING="Moderaterna" FÄRG="#66BEE6" />
  <KOMMUN KOD="0114" NAMN="Upplands Väsby" TYP="Summering" KLARA_VALDISTRIKT="22" ALLA_VALDISTRIKT="22" RÖSTER="23638" RÖSTER_FGVAL="22215" TID_RAPPORT="20140914230336" MODNR="117144935">
    <GILTIGA PARTI="M" RÖSTER="6748" RÖSTER_FGVAL="8201" PROCENT="28,5" PROCENT_FGVAL="36,9" PROCENT_ÄNDRING="-8,4"/>
    <GILTIGA PARTI="C" RÖSTER="901" RÖSTER_FGVAL="891" PROCENT="3,8" PROCENT_FGVAL="4,0" PROCENT_ÄNDRING="-0,2"/>
    <KRETS_KOMMUN KOD="011401" NAMN="Norra valkretsen" TYP="Summering" KLARA_VALDISTRIKT="12" ALLA_VALDISTRIKT="12" RÖSTER="11907" RÖSTER_FGVAL="11202" TID_RAPPORT="20140914222651" MODNR="117118974">
      <GILTIGA PARTI="M" RÖSTER="3083" RÖSTER_FGVAL="3860" PROCENT="25,9" PROCENT_FGVAL="34,5" PROCENT_ÄNDRING="-8,6"/>
      <GILTIGA PARTI="C" RÖSTER="440" RÖSTER_FGVAL="431" PROCENT="3,7" PROCENT_FGVAL="3,8" PROCENT_ÄNDRING="-0,2"/>
      <VALDISTRIKT KOD="01140212" NAMN="Smedby Södra" RÖSTER="1201" RÖSTER_FGVAL="1186" TID_RAPPORT="20140914230336" MODNR="117144935">
         <GILTIGA PARTI="M" RÖSTER="227" RÖSTER_FGVAL="336" PROCENT="18,9" PROCENT_FGVAL="28,3" PROCENT_ÄNDRING="-9,4"/>
         <GILTIGA PARTI="C" RÖSTER="35" RÖSTER_FGVAL="17" PROCENT="2,9" PROCENT_FGVAL="1,4" PROCENT_ÄNDRING="+1,5"/>
         <GILTIGA PARTI="FP" RÖSTER="43" RÖSTER_FGVAL="61" PROCENT="3,6" PROCENT_FGVAL="5,1" PROCENT_ÄNDRING="-1,6"/>
         <ÖVRIGA_GILTIGA RÖSTER="20" RÖSTER_FGVAL="10" PROCENT="1,7" PROCENT_FGVAL="0,8" PROCENT_ÄNDRING="+0,8"/>
         <OGILTIGA TEXT="BLANK" RÖSTER="12" RÖSTER_FGVAL="13" PROCENT="1,0" PROCENT_FGVAL="1,1" PROCENT_ÄNDRING="-0,1"/>
         <OGILTIGA TEXT="OG" RÖSTER="13" RÖSTER_FGVAL="1" PROCENT="1,1" PROCENT_FGVAL="0,1" PROCENT_ÄNDRING="+1,0"/>
         <VALDELTAGANDE RÖSTBERÄTTIGADE="1551" RÖSTBERÄTTIGADE_KLARA_VALDISTRIKT_FGVAL="1546" SUMMA_RÖSTER="1226" SUMMA_RÖSTER_FGVAL="1200" PROCENT="79,0" PROCENT_FGVAL="77,6" PROCENT_ÄNDRING="+1,4"/>
      </VALDISTRIKT>
   </KRETS_KOMMUN>

Now, I would like for each file to get the data at the VALDISTRIKT nodes and below and create a data frame. For a specific file, I have managed to get the data I want into two separate data frames with the following code, where data frame top includes the information directly after VALDISTRIKT (i.e. KOD, NAMN,...,MODNR), and the data frame below includes the information in the nodes below VALDISTRIKT.

## Parse the 4th file in the folder (first file with municipal data reg. municipal election)
t <- read_xml(xml_files[4])
top <- xml_find_all(t, "//VALDISTRIKT")
top <- top %>% 
        map(xml_attrs) %>% 
        map_df(~as.list(.))

below <- xml_find_all(t, "//VALDISTRIKT/*")
below <- below %>% 
    map(xml_attrs) %>% 
    map_df(~as.list(.))

However, I would like to combine them into one dataset, where the information in top becomes variables where the information for a VALDISTRIKT is repeated (filled) for all row that contains information from the nodes that belong to that specific VALDISTRIKT. Following my example xml structure above, I would like the variable KOD to have the value "01140212" for all the rows in the data frame that have information that is located in nodes below, and so on. I've seen some answers on SO on this, but they refer to a simpler structure and I can't get them to work with my data.

In a second step, I would like to combine the information from all the files that have a 4 digit code and 'K' as the last letter in the file name. I assume I should make a function and use purrr in some way to read each file and append them into one data frame.

Any help or suggestions on where to find further information about how to do this (or suggestions of a better approach) would be greatly appreciated.

Updated with a reprex example below

library(xml2)
library(tidyverse)

# Make a temporary file (tf) and a temporary folder (tdir)
tf <- tempfile(tmpdir = tdir <- tempdir())

## Download the zip file 
download.file("https://data.val.se/val/val2014/valnatt/valnatt.zip", tf)

## Unzip it in the temp folder
xml_files <- unzip(tf, exdir = tdir)

## Parse the 4th file in the folder (first file with municipal data reg. municipal election)
t <- read_xml(xml_files[4])

top <- xml_find_all(t, "//VALDISTRIKT")
top %>% map(xml_attrs) %>% 
        map_df(~as.list(.))
#> # A tibble: 22 x 6
#>    KOD      NAMN          `RÃ–STER` `RÃ–STER_FGVAL` TID_RAPPORT    MODNR  
#>    <chr>    <chr>         <chr>     <chr>           <chr>          <chr>  
#>  1 01140206 Apoteksskogen 926       845             20140914222638 117118~
#>  2 01140223 Brunnby       939       784             20140914232815 117167~
#>  3 01140203 Folkparken    1376      1304            20140914232840 117168~
#>  4 01140204 Fysingen      827       800             20140914221446 117110~
#>  5 01140218 Korpkulla     1036      894             20140914231809 117158~
#>  6 01140102 Mälaren       948       911             20140914222351 117117~
#>  7 01140121 Nedra Runby   918       841             20140914221617 117112~
#>  8 01140313 Odenslunda    969       964             20140914230555 117147~
#>  9 01140211 Skälby        1022      909             20140914224426 117130~
#> 10 01140205 Vilunda       1180      1159            20140914232556 117165~
#> # ... with 12 more rows


below <- xml_find_all(t, "//VALDISTRIKT/*")
below %>% map(xml_attrs) %>% 
    map_df(~as.list(.))
#> # A tibble: 330 x 11
#>    PARTI `RÃ–STER` `RÃ–STER_FGVAL` PROCENT PROCENT_FGVAL `PROCENT_Ã„NDRIN~
#>    <chr> <chr>     <chr>           <chr>   <chr>         <chr>            
#>  1 M     136       187             14,7    22,1          -7,4             
#>  2 C     14        17              1,5     2,0           -0,5             
#>  3 FP    42        63              4,5     7,5           -2,9             
#>  4 KD    44        47              4,8     5,6           -0,8             
#>  5 S     363       358             39,2    42,4          -3,2             
#>  6 V     113       62              12,2    7,3           +4,9             
#>  7 MP    97        64              10,5    7,6           +2,9             
#>  8 SD    66        46              7,1     5,4           +1,7             
#>  9 FI    0         <NA>            0,0     <NA>          <NA>             
#> 10 PP    7         <NA>            0,8     <NA>          <NA>             
#> # ... with 320 more rows, and 5 more variables: TEXT <chr>,
#> #   `RÃ–STBERÃ„TTIGADE` <chr>,
#> #   `RÃ–STBERÃ„TTIGADE_KLARA_VALDISTRIKT_FGVAL` <chr>,
#> #   `SUMMA_RÃ–STER` <chr>, `SUMMA_RÃ–STER_FGVAL` <chr>

Created on 2018-06-27 by the reprex package (v0.2.0).

All the best,
Richard

mara · June 27, 2018, 11:05am

It's hard to envision exactly what you're describing without seeing a sample of the two (three?) data frames. Could you please turn this into a self-contained reprex (short for reproducible example)? It will help us help you if we can be sure we're all working with/looking at the same stuff.

install.reprex("reprex")

If you've never heard of a reprex before, you might want to start by reading the tidyverse.org help page. The reprex dos and don'ts are also useful.

What to do if you run into clipboard problems

If you run into problems with access to your clipboard, you can specify an outfile for the reprex, and then copy and paste the contents into the forum.

reprex::reprex(input = "fruits_stringdist.R", outfile = "fruits_stringdist.md")

For pointers specific to the community site, check out the reprex FAQ, linked to below.

richardo · June 27, 2018, 11:42am

Sorry, I will update the question. Richard

mara · June 27, 2018, 11:43am

No problem! You've got most of the pieces there, it'll just be much easier to troubleshoot with the input and output visible!

richardo · June 27, 2018, 11:58am

Added a reprex, hopefully it's possible to understand what I would like to do, but let me know if I need to clarify further.

richardo · June 27, 2018, 5:27pm

After searching more on SO I found a solution for the first step using lapply in this post https://stackoverflow.com/questions/34273132/r-how-to-convert-xml-to-dataframe-in-r-with-the-correct-structure/34273941#34273941. Using my data, it would look something like this for one file.

library(xml2)
library(tidyverse)

# Make a temporary file (tf) and a temporary folder (tdir)
tf <- tempfile(tmpdir = tdir <- tempdir())

## Download the zip file 
download.file("https://data.val.se/val/val2014/valnatt/valnatt.zip", tf)

## Unzip it in the temp folder
xml_files <- unzip(tf, exdir = tdir)

## Parse the 4th file in the folder (first file with municipal data reg. municipal election)
t <- read_xml(xml_files[4])

# bind the data.frames built in the iterator together
df <- bind_rows(lapply(xml_find_all(t, "//VALDISTRIKT"), function(x) {
    
    # extract the attributes from the parent tag as a data.frame
    parent <- data.frame(as.list(xml_attrs(x)), stringsAsFactors=FALSE)
    
    # make a data.frame out of the attributes of the kids
    kids <- bind_rows(lapply(xml_children(x), function(x) as.list(xml_attrs(x))))
    
    # combine them
    cbind.data.frame(parent, kids, stringsAsFactors=FALSE)
    
}))
head(df, 16)
#>         KOD          NAMN RÃ.STER RÃ.STER_FGVAL    TID_RAPPORT     MODNR
#> 1  01140206 Apoteksskogen     926           845 20140914222638 117118920
#> 2  01140206 Apoteksskogen     926           845 20140914222638 117118920
#> 3  01140206 Apoteksskogen     926           845 20140914222638 117118920
#> 4  01140206 Apoteksskogen     926           845 20140914222638 117118920
#> 5  01140206 Apoteksskogen     926           845 20140914222638 117118920
#> 6  01140206 Apoteksskogen     926           845 20140914222638 117118920
#> 7  01140206 Apoteksskogen     926           845 20140914222638 117118920
#> 8  01140206 Apoteksskogen     926           845 20140914222638 117118920
#> 9  01140206 Apoteksskogen     926           845 20140914222638 117118920
#> 10 01140206 Apoteksskogen     926           845 20140914222638 117118920
#> 11 01140206 Apoteksskogen     926           845 20140914222638 117118920
#> 12 01140206 Apoteksskogen     926           845 20140914222638 117118920
#> 13 01140206 Apoteksskogen     926           845 20140914222638 117118920
#> 14 01140206 Apoteksskogen     926           845 20140914222638 117118920
#> 15 01140206 Apoteksskogen     926           845 20140914222638 117118920
#> 16 01140223       Brunnby     939           784 20140914232815 117167932
#>    PARTI RÃ–STER RÃ–STER_FGVAL PROCENT PROCENT_FGVAL PROCENT_Ã„NDRING
#> 1      M     136           187    14,7          22,1             -7,4
#> 2      C      14            17     1,5           2,0             -0,5
#> 3     FP      42            63     4,5           7,5             -2,9
#> 4     KD      44            47     4,8           5,6             -0,8
#> 5      S     363           358    39,2          42,4             -3,2
#> 6      V     113            62    12,2           7,3             +4,9
#> 7     MP      97            64    10,5           7,6             +2,9
#> 8     SD      66            46     7,1           5,4             +1,7
#> 9     FI       0          <NA>     0,0          <NA>             <NA>
#> 10    PP       7          <NA>     0,8          <NA>             <NA>
#> 11    VB      39          <NA>     4,2          <NA>             <NA>
#> 12  <NA>       5             1     0,5           0,1             +0,4
#> 13  <NA>       4             8     0,4           0,9             -0,5
#> 14  <NA>      13             0     1,4           0,0             +1,4
#> 15  <NA>    <NA>          <NA>    64,8          64,6             +0,1
#> 16     M     302           299    32,2          38,1             -6,0
#>     TEXT RÃ–STBERÃ„TTIGADE RÃ–STBERÃ„TTIGADE_KLARA_VALDISTRIKT_FGVAL
#> 1   <NA>              <NA>                                      <NA>
#> 2   <NA>              <NA>                                      <NA>
#> 3   <NA>              <NA>                                      <NA>
#> 4   <NA>              <NA>                                      <NA>
#> 5   <NA>              <NA>                                      <NA>
#> 6   <NA>              <NA>                                      <NA>
#> 7   <NA>              <NA>                                      <NA>
#> 8   <NA>              <NA>                                      <NA>
#> 9   <NA>              <NA>                                      <NA>
#> 10  <NA>              <NA>                                      <NA>
#> 11  <NA>              <NA>                                      <NA>
#> 12  <NA>              <NA>                                      <NA>
#> 13 BLANK              <NA>                                      <NA>
#> 14    OG              <NA>                                      <NA>
#> 15  <NA>              1456                                      1320
#> 16  <NA>              <NA>                                      <NA>
#>    SUMMA_RÃ–STER SUMMA_RÃ–STER_FGVAL
#> 1           <NA>                <NA>
#> 2           <NA>                <NA>
#> 3           <NA>                <NA>
#> 4           <NA>                <NA>
#> 5           <NA>                <NA>
#> 6           <NA>                <NA>
#> 7           <NA>                <NA>
#> 8           <NA>                <NA>
#> 9           <NA>                <NA>
#> 10          <NA>                <NA>
#> 11          <NA>                <NA>
#> 12          <NA>                <NA>
#> 13          <NA>                <NA>
#> 14          <NA>                <NA>
#> 15           943                 853
#> 16          <NA>                <NA>

Created on 2018-06-27 by the reprex package (v0.2.0).

But I assume I could use purrr instead of lapply? And the second step, i.e. get information from all the 290 files with a filename with 4 digits in it and the letter 'K' at the end and append them to a data frame, still eludes me. Any help would be greatly appreciated.

cderv · June 28, 2018, 6:33am

I came up with one way to use tidyverse. I changed different things. If you just want to replace lapply, use purrr::map. map_dfr does the bind_rows inside. Take care: Base R is more permissive and tidyverse less. For example, it seems you have some identical columns names in your parent and kids table. not good!

library(xml2)
library(tidyverse)

# Make a temporary file (tf) and a temporary folder (tdir)
tf <- tempfile(tmpdir = tdir <- tempdir())

## Download the zip file 
download.file("https://data.val.se/val/val2014/valnatt/valnatt.zip", tf)

## Unzip it in the temp folder
xml_files <- unzip(tf, exdir = tdir)

## Parse the 4th file in the folder (first file with municipal data reg. municipal election)
t <- read_xml(xml_files[4])

df <- xml_find_all(t, "//VALDISTRIKT") %>% 
  map_dfr(~ {
    # extract the attributes from the parent tag as a data.frame
    parent <- xml_attrs(.x) %>% enframe() %>% spread(name, value)
    # make a data.frame out of the attributes of the kids
    kids <- xml_children(.x) %>% map_dfr(~ as.list(xml_attrs(.x)))
    # combine them (bind_cols does not repeat parent rows)
    cbind.data.frame(parent, kids) %>% set_tidy_names() %>% as_tibble() 
  })
#> New names:
#> RÃ–STER -> RÃ–STER..4
#> RÃ–STER_FGVAL -> RÃ–STER_FGVAL..5
#> RÃ–STER -> RÃ–STER..8
#> RÃ–STER_FGVAL -> RÃ–STER_FGVAL..9
#> New names:
#> RÃ–STER -> RÃ–STER..4
#> RÃ–STER_FGVAL -> RÃ–STER_FGVAL..5
#> RÃ–STER -> RÃ–STER..8
#> RÃ–STER_FGVAL -> RÃ–STER_FGVAL..9
#> New names:
#> RÃ–STER -> RÃ–STER..4
#> RÃ–STER_FGVAL -> RÃ–STER_FGVAL..5
#> RÃ–STER -> RÃ–STER..8
#> RÃ–STER_FGVAL -> RÃ–STER_FGVAL..9
#> New names:
#> RÃ–STER -> RÃ–STER..4
#> RÃ–STER_FGVAL -> RÃ–STER_FGVAL..5
#> RÃ–STER -> RÃ–STER..8
#> RÃ–STER_FGVAL -> RÃ–STER_FGVAL..9
#> New names:
#> RÃ–STER -> RÃ–STER..4
#> RÃ–STER_FGVAL -> RÃ–STER_FGVAL..5
#> RÃ–STER -> RÃ–STER..8
#> RÃ–STER_FGVAL -> RÃ–STER_FGVAL..9
#> New names:
#> RÃ–STER -> RÃ–STER..4
#> RÃ–STER_FGVAL -> RÃ–STER_FGVAL..5
#> RÃ–STER -> RÃ–STER..8
#> RÃ–STER_FGVAL -> RÃ–STER_FGVAL..9
#> New names:
#> RÃ–STER -> RÃ–STER..4
#> RÃ–STER_FGVAL -> RÃ–STER_FGVAL..5
#> RÃ–STER -> RÃ–STER..8
#> RÃ–STER_FGVAL -> RÃ–STER_FGVAL..9
#> New names:
#> RÃ–STER -> RÃ–STER..4
#> RÃ–STER_FGVAL -> RÃ–STER_FGVAL..5
#> RÃ–STER -> RÃ–STER..8
#> RÃ–STER_FGVAL -> RÃ–STER_FGVAL..9
#> New names:
#> RÃ–STER -> RÃ–STER..4
#> RÃ–STER_FGVAL -> RÃ–STER_FGVAL..5
#> RÃ–STER -> RÃ–STER..8
#> RÃ–STER_FGVAL -> RÃ–STER_FGVAL..9
#> New names:
#> RÃ–STER -> RÃ–STER..4
#> RÃ–STER_FGVAL -> RÃ–STER_FGVAL..5
#> RÃ–STER -> RÃ–STER..8
#> RÃ–STER_FGVAL -> RÃ–STER_FGVAL..9
#> New names:
#> RÃ–STER -> RÃ–STER..4
#> RÃ–STER_FGVAL -> RÃ–STER_FGVAL..5
#> RÃ–STER -> RÃ–STER..8
#> RÃ–STER_FGVAL -> RÃ–STER_FGVAL..9
#> New names:
#> RÃ–STER -> RÃ–STER..4
#> RÃ–STER_FGVAL -> RÃ–STER_FGVAL..5
#> RÃ–STER -> RÃ–STER..8
#> RÃ–STER_FGVAL -> RÃ–STER_FGVAL..9
#> New names:
#> RÃ–STER -> RÃ–STER..4
#> RÃ–STER_FGVAL -> RÃ–STER_FGVAL..5
#> RÃ–STER -> RÃ–STER..8
#> RÃ–STER_FGVAL -> RÃ–STER_FGVAL..9
#> New names:
#> RÃ–STER -> RÃ–STER..4
#> RÃ–STER_FGVAL -> RÃ–STER_FGVAL..5
#> RÃ–STER -> RÃ–STER..8
#> RÃ–STER_FGVAL -> RÃ–STER_FGVAL..9
#> New names:
#> RÃ–STER -> RÃ–STER..4
#> RÃ–STER_FGVAL -> RÃ–STER_FGVAL..5
#> RÃ–STER -> RÃ–STER..8
#> RÃ–STER_FGVAL -> RÃ–STER_FGVAL..9
#> New names:
#> RÃ–STER -> RÃ–STER..4
#> RÃ–STER_FGVAL -> RÃ–STER_FGVAL..5
#> RÃ–STER -> RÃ–STER..8
#> RÃ–STER_FGVAL -> RÃ–STER_FGVAL..9
#> New names:
#> RÃ–STER -> RÃ–STER..4
#> RÃ–STER_FGVAL -> RÃ–STER_FGVAL..5
#> RÃ–STER -> RÃ–STER..8
#> RÃ–STER_FGVAL -> RÃ–STER_FGVAL..9
#> New names:
#> RÃ–STER -> RÃ–STER..4
#> RÃ–STER_FGVAL -> RÃ–STER_FGVAL..5
#> RÃ–STER -> RÃ–STER..8
#> RÃ–STER_FGVAL -> RÃ–STER_FGVAL..9
#> New names:
#> RÃ–STER -> RÃ–STER..4
#> RÃ–STER_FGVAL -> RÃ–STER_FGVAL..5
#> RÃ–STER -> RÃ–STER..8
#> RÃ–STER_FGVAL -> RÃ–STER_FGVAL..9
#> New names:
#> RÃ–STER -> RÃ–STER..4
#> RÃ–STER_FGVAL -> RÃ–STER_FGVAL..5
#> RÃ–STER -> RÃ–STER..8
#> RÃ–STER_FGVAL -> RÃ–STER_FGVAL..9
#> New names:
#> RÃ–STER -> RÃ–STER..4
#> RÃ–STER_FGVAL -> RÃ–STER_FGVAL..5
#> RÃ–STER -> RÃ–STER..8
#> RÃ–STER_FGVAL -> RÃ–STER_FGVAL..9
#> New names:
#> RÃ–STER -> RÃ–STER..4
#> RÃ–STER_FGVAL -> RÃ–STER_FGVAL..5
#> RÃ–STER -> RÃ–STER..8
#> RÃ–STER_FGVAL -> RÃ–STER_FGVAL..9
df
#> # A tibble: 330 x 17
#>    KOD    MODNR   NAMN     `RÃ–STER..4` `RÃ–STER_FGVAL.~ TID_RAPPORT PARTI
#>    <chr>  <chr>   <chr>    <chr>        <chr>            <chr>       <chr>
#>  1 01140~ 117118~ Apoteks~ 926          845              2014091422~ M    
#>  2 01140~ 117118~ Apoteks~ 926          845              2014091422~ C    
#>  3 01140~ 117118~ Apoteks~ 926          845              2014091422~ FP   
#>  4 01140~ 117118~ Apoteks~ 926          845              2014091422~ KD   
#>  5 01140~ 117118~ Apoteks~ 926          845              2014091422~ S    
#>  6 01140~ 117118~ Apoteks~ 926          845              2014091422~ V    
#>  7 01140~ 117118~ Apoteks~ 926          845              2014091422~ MP   
#>  8 01140~ 117118~ Apoteks~ 926          845              2014091422~ SD   
#>  9 01140~ 117118~ Apoteks~ 926          845              2014091422~ FI   
#> 10 01140~ 117118~ Apoteks~ 926          845              2014091422~ PP   
#> # ... with 320 more rows, and 10 more variables: `RÃ–STER..8` <chr>,
#> #   `RÃ–STER_FGVAL..9` <chr>, PROCENT <chr>, PROCENT_FGVAL <chr>,
#> #   `PROCENT_Ã„NDRING` <chr>, TEXT <chr>, `RÃ–STBERÃ„TTIGADE` <chr>,
#> #   `RÃ–STBERÃ„TTIGADE_KLARA_VALDISTRIKT_FGVAL` <chr>,
#> #   `SUMMA_RÃ–STER` <chr>, `SUMMA_RÃ–STER_FGVAL` <chr>

Created on 2018-06-28 by the reprex package (v0.2.0).

About step 2

You can use fs to manipulate files and folder and stringr to detect your string pattern. Get the list of the files in your directory and keep only those name with valnatt_ then 4 digits then the letter K, using a regex:"valnatt_\d{4}K.xml$`

library(xml2)
library(tidyverse)
#> Warning: le package 'tibble' a été compilé avec la version R 3.4.4
#> Warning: le package 'tidyr' a été compilé avec la version R 3.4.4
#> Warning: le package 'purrr' a été compilé avec la version R 3.4.4
#> Warning: le package 'dplyr' a été compilé avec la version R 3.4.4
#> Warning: le package 'stringr' a été compilé avec la version R 3.4.4

# Make a temporary file (tf) and a temporary folder (tdir)
tf <- tempfile(tmpdir = tdir <- tempdir())

## Download the zip file 
download.file("https://data.val.se/val/val2014/valnatt/valnatt.zip", tf)

## Unzip it in the temp folder
xml_files <- unzip(tf, exdir = tdir)

files_to_import <- fs::dir_ls(tdir) %>%
  str_subset(pattern = "valnatt_\\d{4}K.xml$")
head(files_to_import)
#> [1] "C:/Users/chris/AppData/Local/Temp/Rtmpo5NbU2/valnatt_0114K.xml"
#> [2] "C:/Users/chris/AppData/Local/Temp/Rtmpo5NbU2/valnatt_0115K.xml"
#> [3] "C:/Users/chris/AppData/Local/Temp/Rtmpo5NbU2/valnatt_0117K.xml"
#> [4] "C:/Users/chris/AppData/Local/Temp/Rtmpo5NbU2/valnatt_0120K.xml"
#> [5] "C:/Users/chris/AppData/Local/Temp/Rtmpo5NbU2/valnatt_0123K.xml"
#> [6] "C:/Users/chris/AppData/Local/Temp/Rtmpo5NbU2/valnatt_0125K.xml"
length(files_to_import)
#> [1] 290

Created on 2018-06-28 by the reprex package (v0.2.0).

You can then use purrr::map and friends on this vector of use fs::dir_map() but you'll less control on the result.

Hopes it helps.

richardo · June 28, 2018, 7:38am

Thank you so much for your help, it is very appreciated! Just a final question on the last line of your comment, so I don't misunderstand: do you mean that I could either use purrr::map or fs::dir_map, but I have less control over the results if I use the latter?

cderv · June 28, 2018, 12:21pm

Yes. I meant that in purrr you have ˋmapbut also ˋmap_df and others.

ˋfs::dir_map` applies a map and return a list. You’ll have to bind afterwards. Moreover, it applies on all the files, so I am not sure if you can efficiently filter.

Just test and you’ll see. At the end, It is not so important. purrr::map will works.

richardo · June 28, 2018, 2:44pm

I understand. Thank you once again, I tried using purrr::map and it worked fine.

cderv · June 29, 2018, 5:40am

Glad I could help!

If it is good for you, can you mark your topic as solved so that everyone knows it has been answered ?
It will appears as so in search and in the topic list.

Thanks!

richardo · June 29, 2018, 5:54am

I marked it as solved as soon as I read your first reply, let me know if I did not do that correctly.

Thanks once again!