Hi everyone!
I am trying to retrieve some xml data with Swedish election statistics and create a data frame in R out of them, but I'm not that familiar with xml files and struggle to get the result I want. Any help would be greatly appreciated.
The data is published by the Swedish Election Authority as a zipped folder with many xml files. The folder contains files for each of the 290 municipalities (files with 4 digit codes) and each election type, where the final letter in the filename indicates the type of election: R=national parliament, L=county council, K=municipal council (for county councils there is only 289 municipalities). The folder also contains 3 XML files for total results at the national level for each of the three election types. I've managed to read the files into R with the following code:
library(xml2)
library(tidyverse)
tf <- tempfile(tmpdir = tdir <- tempdir())
download.file("https://data.val.se/val/val2014/valnatt/valnatt.zip", tf)
xml_files <- unzip(tf, exdir = tdir)
The XML files with municipal data have the following structure (lines deleted for clarity):
<?xml version="1.0" encoding="ISO-8859-1"?>
<?xml-stylesheet type="text/html"?>
<!DOCTYPE VAL PUBLIC "-//Valmyndigheten//DTD Valresultat parti kommun 1.5//SV" "http://www.val.se/dtd/resultat/parti_kommun_1_5.dtd">
<VAL TILLFÄLLE="Allmänna val 14 september 2014" FILNAMN="valnatt_0114R.xml" RAPPORTERING="VALNATTSRAPPORTERING" VALTYP="Riksdagsval" VALDAG="20140914" VALDAG_FGVAL="20100919" TID_RAPPORT="20140916105203">
<PARTI FÖRKORTNING="M" BETECKNING="Moderaterna" FÄRG="#66BEE6" />
<KOMMUN KOD="0114" NAMN="Upplands Väsby" TYP="Summering" KLARA_VALDISTRIKT="22" ALLA_VALDISTRIKT="22" RÖSTER="23638" RÖSTER_FGVAL="22215" TID_RAPPORT="20140914230336" MODNR="117144935">
<GILTIGA PARTI="M" RÖSTER="6748" RÖSTER_FGVAL="8201" PROCENT="28,5" PROCENT_FGVAL="36,9" PROCENT_ÄNDRING="-8,4"/>
<GILTIGA PARTI="C" RÖSTER="901" RÖSTER_FGVAL="891" PROCENT="3,8" PROCENT_FGVAL="4,0" PROCENT_ÄNDRING="-0,2"/>
<KRETS_KOMMUN KOD="011401" NAMN="Norra valkretsen" TYP="Summering" KLARA_VALDISTRIKT="12" ALLA_VALDISTRIKT="12" RÖSTER="11907" RÖSTER_FGVAL="11202" TID_RAPPORT="20140914222651" MODNR="117118974">
<GILTIGA PARTI="M" RÖSTER="3083" RÖSTER_FGVAL="3860" PROCENT="25,9" PROCENT_FGVAL="34,5" PROCENT_ÄNDRING="-8,6"/>
<GILTIGA PARTI="C" RÖSTER="440" RÖSTER_FGVAL="431" PROCENT="3,7" PROCENT_FGVAL="3,8" PROCENT_ÄNDRING="-0,2"/>
<VALDISTRIKT KOD="01140212" NAMN="Smedby Södra" RÖSTER="1201" RÖSTER_FGVAL="1186" TID_RAPPORT="20140914230336" MODNR="117144935">
<GILTIGA PARTI="M" RÖSTER="227" RÖSTER_FGVAL="336" PROCENT="18,9" PROCENT_FGVAL="28,3" PROCENT_ÄNDRING="-9,4"/>
<GILTIGA PARTI="C" RÖSTER="35" RÖSTER_FGVAL="17" PROCENT="2,9" PROCENT_FGVAL="1,4" PROCENT_ÄNDRING="+1,5"/>
<GILTIGA PARTI="FP" RÖSTER="43" RÖSTER_FGVAL="61" PROCENT="3,6" PROCENT_FGVAL="5,1" PROCENT_ÄNDRING="-1,6"/>
<ÖVRIGA_GILTIGA RÖSTER="20" RÖSTER_FGVAL="10" PROCENT="1,7" PROCENT_FGVAL="0,8" PROCENT_ÄNDRING="+0,8"/>
<OGILTIGA TEXT="BLANK" RÖSTER="12" RÖSTER_FGVAL="13" PROCENT="1,0" PROCENT_FGVAL="1,1" PROCENT_ÄNDRING="-0,1"/>
<OGILTIGA TEXT="OG" RÖSTER="13" RÖSTER_FGVAL="1" PROCENT="1,1" PROCENT_FGVAL="0,1" PROCENT_ÄNDRING="+1,0"/>
<VALDELTAGANDE RÖSTBERÄTTIGADE="1551" RÖSTBERÄTTIGADE_KLARA_VALDISTRIKT_FGVAL="1546" SUMMA_RÖSTER="1226" SUMMA_RÖSTER_FGVAL="1200" PROCENT="79,0" PROCENT_FGVAL="77,6" PROCENT_ÄNDRING="+1,4"/>
</VALDISTRIKT>
</KRETS_KOMMUN>
Now, I would like for each file to get the data at the VALDISTRIKT nodes and below and create a data frame. For a specific file, I have managed to get the data I want into two separate data frames with the following code, where data frame top
includes the information directly after VALDISTRIKT (i.e. KOD, NAMN,...,MODNR), and the data frame below
includes the information in the nodes below VALDISTRIKT.
## Parse the 4th file in the folder (first file with municipal data reg. municipal election)
t <- read_xml(xml_files[4])
top <- xml_find_all(t, "//VALDISTRIKT")
top <- top %>%
map(xml_attrs) %>%
map_df(~as.list(.))
below <- xml_find_all(t, "//VALDISTRIKT/*")
below <- below %>%
map(xml_attrs) %>%
map_df(~as.list(.))
However, I would like to combine them into one dataset, where the information in top
becomes variables where the information for a VALDISTRIKT is repeated (filled) for all row that contains information from the nodes that belong to that specific VALDISTRIKT. Following my example xml structure above, I would like the variable KOD to have the value "01140212" for all the rows in the data frame that have information that is located in nodes below, and so on. I've seen some answers on SO on this, but they refer to a simpler structure and I can't get them to work with my data.
In a second step, I would like to combine the information from all the files that have a 4 digit code and 'K' as the last letter in the file name. I assume I should make a function and use purrr in some way to read each file and append them into one data frame.
Any help or suggestions on where to find further information about how to do this (or suggestions of a better approach) would be greatly appreciated.
Updated with a reprex example below
library(xml2)
library(tidyverse)
# Make a temporary file (tf) and a temporary folder (tdir)
tf <- tempfile(tmpdir = tdir <- tempdir())
## Download the zip file
download.file("https://data.val.se/val/val2014/valnatt/valnatt.zip", tf)
## Unzip it in the temp folder
xml_files <- unzip(tf, exdir = tdir)
## Parse the 4th file in the folder (first file with municipal data reg. municipal election)
t <- read_xml(xml_files[4])
top <- xml_find_all(t, "//VALDISTRIKT")
top %>% map(xml_attrs) %>%
map_df(~as.list(.))
#> # A tibble: 22 x 6
#> KOD NAMN `RÖSTER` `RÖSTER_FGVAL` TID_RAPPORT MODNR
#> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 01140206 Apoteksskogen 926 845 20140914222638 117118~
#> 2 01140223 Brunnby 939 784 20140914232815 117167~
#> 3 01140203 Folkparken 1376 1304 20140914232840 117168~
#> 4 01140204 Fysingen 827 800 20140914221446 117110~
#> 5 01140218 Korpkulla 1036 894 20140914231809 117158~
#> 6 01140102 Mälaren 948 911 20140914222351 117117~
#> 7 01140121 Nedra Runby 918 841 20140914221617 117112~
#> 8 01140313 Odenslunda 969 964 20140914230555 117147~
#> 9 01140211 Skälby 1022 909 20140914224426 117130~
#> 10 01140205 Vilunda 1180 1159 20140914232556 117165~
#> # ... with 12 more rows
below <- xml_find_all(t, "//VALDISTRIKT/*")
below %>% map(xml_attrs) %>%
map_df(~as.list(.))
#> # A tibble: 330 x 11
#> PARTI `RÖSTER` `RÖSTER_FGVAL` PROCENT PROCENT_FGVAL `PROCENT_ÄNDRIN~
#> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 M 136 187 14,7 22,1 -7,4
#> 2 C 14 17 1,5 2,0 -0,5
#> 3 FP 42 63 4,5 7,5 -2,9
#> 4 KD 44 47 4,8 5,6 -0,8
#> 5 S 363 358 39,2 42,4 -3,2
#> 6 V 113 62 12,2 7,3 +4,9
#> 7 MP 97 64 10,5 7,6 +2,9
#> 8 SD 66 46 7,1 5,4 +1,7
#> 9 FI 0 <NA> 0,0 <NA> <NA>
#> 10 PP 7 <NA> 0,8 <NA> <NA>
#> # ... with 320 more rows, and 5 more variables: TEXT <chr>,
#> # `RÖSTBERÄTTIGADE` <chr>,
#> # `RÖSTBERÄTTIGADE_KLARA_VALDISTRIKT_FGVAL` <chr>,
#> # `SUMMA_RÖSTER` <chr>, `SUMMA_RÖSTER_FGVAL` <chr>
Created on 2018-06-27 by the reprex package (v0.2.0).
All the best,
Richard