Apply R script over chunks or batches taken from an SQL database

August · March 16, 2022, 5:13pm

Hi All,

I am having a problem with batch processing, namely I don't know how to do it outside of sprak, and as my organization doesn't have cluster set up I don't think I can use it. I certainly cannot transfer the data over.

My main problem is that my script is set up for doing nested tables for a batch of the item groups (n = 20) which are pulled into R from SQL via odbc. These groups are used to select item ids in order to get the related time-series data. The format after the querey would be item_grup as the nested tables row ids and the time series data as dataframes nested in the rows for the relevant item groups.

The script then goes through a considerable amount of wrangling and calculations, which is all processed in parallel before finally creating some metrics for each item group that are then fed back into the database. While I attempt to reduce memory demand at each stage, the nested table still reaches around 6gb of ram for a nested table of just 10 item groups. So at 20 we are about at the maximum here for local processing, and I need to work through 2000 active groups.

What I need to do is apply this process/script to several thousand item groups. Which I am not sure how to do, I was thinking I could get the item groups database and add create grouping numbers for each chunk of item groups, to be processed, and then apply the script to each chunk of item groups. But this last part is what I don't have a clue on how to do, and I'm not sure how to make the script work through chunks of item groups iteratively.

I'm not used to putting things into production as most my work is experiments which developers put into practice, so I would appreciate any ideas, thanks for reading.

image below is basic structure of nested table before wrangling.
image (19)

John_Rambo · March 16, 2022, 9:46pm

I use something like this on Ubuntu (shell-script), to trigger an R-Script subsequently...where 'i' could be an argument given to the r-script and therefore to choose a specific SQL-Query.


FROMHERE=1
    array=( ChunkA ChunkB )
    for ((c=FROMHERE; c<=1; c++))
    do  
        cycleSamples=$(( c ))
        echo $cycleSamples
        for i in "${array[@]}"
        do
            Rscript --vanilla scoring.R $i $cycleSamples
        done
    done

system · April 6, 2022, 9:46pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.