I am seeking a way to filter a dataset - keeping just few columns and rows - which is very big (french population census) before importing to a R dataframe.
R cannot directly import the data set given its volume (> 250 mo)...
Basically, i'm looking for an Extract-Transform-Load solution with R (which I love !).
One way to filter by rows in Pandas is to use boolean expression . We first create a boolean variable by taking the column of interest and checking if its value equals to the specific value that we want to select/keep. For example, let us filter the dataframe or subset the dataframe based on year's value 2002..
Edit: I forgot to mention that the R package duckdb works together with DBI and that the database can work in-memory without a separate server.
Edit: below an example of what I meant to do with selection and filtering during loading.
However I do not know at the exact moment that this is done (because I use only a small input file).
library(duckdb)
#> Loading required package: DBI
# original file (note long filename split over two lines)
# https://www.dnb.nl/statistieken/data-zoeken/#/details/aandelenbeursindices/dataset/
# 71497a3a-391b-41e3-a858-0d1cc6475208/resource/64131a7e-3eaa-45f1-a915-b47dbf05b517
# in local file the column "Periode " was changed to "Periode"
filename <- r'(D:\data\R\RStudio_Community\Aandelenbeursindices per dag 2021H2.csv)'
con <- dbConnect(duckdb())
str(dbGetInfo(con))
#> List of 5
#> $ dbname : chr ":memory:"
#> $ db.version: chr "0.3.2-dev1"
#> $ username : logi NA
#> $ host : logi NA
#> $ port : logi NA
sql <- glue::glue(
"CREATE TABLE WW AS SELECT \"Soort Index\", \"Periode\", \"waarde\" FROM \'{filename}\' a WHERE a.\"Periode\" = '2021-07-01' ;")
print(sql)
#> CREATE TABLE WW AS SELECT "Soort Index", "Periode", "waarde" FROM 'D:\data\R\RStudio_Community\Aandelenbeursindices per dag 2021H2.csv' a WHERE a."Periode" = '2021-07-01' ;
dbExecute(con,sql)
#> [1] 9
DBI::dbListTables(con)
#> [1] "ww"
df1<-DBI::dbReadTable(con,"WW")
str(df1)
#> 'data.frame': 9 obs. of 3 variables:
#> $ Soort.index: chr "AEX-index " "Midkap-index " "All-share index " "Financiële instellingen " ...
#> $ Periode : Date, format: "2021-07-01" "2021-07-01" ...
#> $ waarde : num 731 1050 1042 308 34634 ...
dbDisconnect(con)
Created on 2021-11-27 by the reprex package (v2.0.0)
It's weird because, normally when you use :: method you don't need to download en load the package, right ?
Anyway, it's work fine for me, here my code (sorry for comments in French...) :
# Import des données
# Installation et chargement du package dédié à l'import
install.packages("readr")
library(readr)
# Fonction pour créer un sous-ensemble du jeu de données d'origine (ici nommé x)
f = function(x, pos)
subset(x,
REGION == "93" & NAF08 !="ZZZZZ",
select = c(REGION, IPONDI,DIPL,EMPL,NAF08,SEXE,STAT,TACTD16,TP,AGED))
# Application de la fonction f() et stockage du sous-ensemble généré dans une table R (rp2018PacaV2)
## On va appliquer la fonction sur des portions (chunks) du jeu de données d'origine, petit à petit (par lot de 10 000 lignes)
## avant d'importer les données filtrées.
rp2018PacaV2=read_csv2_chunked(
file ="C:/Users/fabien/Desktop/Import_parties/sources/FD_INDREG_2018.csv", # Chemin d'accès aux données sources
DataFrameCallback$new(f), # Appel de la fonction créée plus haut
chunk_size = 10000, # Nombre de lignes à traiter pour chaque répétition de la fonction f()
col_names = TRUE) # La première ligne contient les noms des colonnes
# Export des données de sortie au format Excel
## Installation et chargement du paquet
install.packages("openxlsx")
library(openxlsx)
## Création d'un fichier et d'un classeur Excel vide
fileName = "rp2018PacaV2.xlsx"
excel <- createWorkbook(fileName)
## Création d'une feuille Excel vide
firstSheet <- "Feuille1"
## Ajout de la feuille dans le classeur
addWorksheet(excel, firstSheet)
## Ajout des données de la table R dans le classeur
writeData(excel, sheet = 1, rp2018PacaV2)
## Sauvegarde des données ajoutées dans le fichier Excel
saveWorkbook(excel, file = fileName, overwrite = TRUE)
That's great, happy you have a solution.
Just to clarify. when you use :: to access a function from a package, you get that function and it can use internal function calls from that same package, but you still can't. you would need to use :: in your own code to access the other function like readr::DataFrameCallback, however, your solution of loading the entire library is a great solution. library(readr) then your code neednt use :: unless there is some namespace ambiguity to clarify.
Worth noting that the arrow package is relatively new but is oriented around solving problems like this. Not sure whether it is a perfect fit for your environment or not. Just wanted to mention in case it is ever useful for you or others!
I think the data.table library would be useful. The fread function allows a user to specify which columns to import via the select parameter. Also, users can specify the number of threads to make import more efficient.
I haven't previously filtered records prior to importing, however, data.table is so efficient that filtering post import shouldn't be an issue. And thanks to recent updates you can use dplyr syntax with data.tabels