I have been trying to figure out how to generate the correct data structure for input data into a keras LSTM in R.
My current workflow has been to generate the data in R, export it as a CSV, and read it into Python, and then reshape the input data in Python. Since R now supports Keras, I'd like to remove the Python steps.
The input into an LSTM needs to be 3-dimensions, with the dimensions being: training sample, time step, and features. Here's is a toy example for a dataset with 3 samples, each with 4 time steps, and 2 features.
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
repr <- tibble(
id = c(rep('a',4), rep('b',4), rep('c',4)),
time_step = rep(1:4,3),
feature_1 = seq(from = 1, to = 12) / 10,
feature_2 = seq(from = 1, to = 12) / 100
)
repr
#> # A tibble: 12 x 4
#> id time_step feature_1 feature_2
#> <chr> <int> <dbl> <dbl>
#> 1 a 1 0.1 0.01
#> 2 a 2 0.2 0.02
#> 3 a 3 0.3 0.03
#> 4 a 4 0.4 0.04
#> 5 b 1 0.5 0.05
#> 6 b 2 0.6 0.06
#> 7 b 3 0.7 0.07
#> 8 b 4 0.8 0.08
#> 9 c 1 0.9 0.09
#> 10 c 2 1 0.1
#> 11 c 3 1.1 0.11
#> 12 c 4 1.2 0.12
repr[, 3:4]
#> # A tibble: 12 x 2
#> feature_1 feature_2
#> <dbl> <dbl>
#> 1 0.1 0.01
#> 2 0.2 0.02
#> 3 0.3 0.03
#> 4 0.4 0.04
#> 5 0.5 0.05
#> 6 0.6 0.06
#> 7 0.7 0.07
#> 8 0.8 0.08
#> 9 0.9 0.09
#> 10 1 0.1
#> 11 1.1 0.11
#> 12 1.2 0.12
repr[, 3:4] %>% write.csv(file = 'reprex.csv', row.names = FALSE)
In Python, I could execute the following and use it as input training data. Notice the simple 'reshape' operation, where I reshape the two feature columns into the appropriate 3-dimensional input, with dimensions [3 samples] [4 time steps] [2 features]
m = genfromtxt('reprex.csv', delimiter=',', skip_header=1)
m
# array([[0.1 , 0.01],
# [0.2 , 0.02],
# [0.3 , 0.03],
# [0.4 , 0.04],
# [0.5 , 0.05],
# [0.6 , 0.06],
# [0.7 , 0.07],
# [0.8 , 0.08],
# [0.9 , 0.09],
# [1. , 0.1 ],
# [1.1 , 0.11],
# [1.2 , 0.12]])
m.reshape(3, 4, 2)
# array([[[0.1 , 0.01],
# [0.2 , 0.02],
# [0.3 , 0.03],
# [0.4 , 0.04]],
#
# [[0.5 , 0.05],
# [0.6 , 0.06],
# [0.7 , 0.07],
# [0.8 , 0.08]],
#
# [[0.9 , 0.09],
# [1. , 0.1 ],
# [1.1 , 0.11],
# [1.2 , 0.12]]])
The R Keras examples at https://cran.rstudio.com/web/packages/keras/vignettes/sequential_model.html (under Stacked LSTM for sequence classification) have a hint about how I might proceed.
x_train <- array(runif(1000 * timesteps * data_dim), dim = c(1000, timesteps, data_dim))
Would I somehow need to stack all the feature columns in the data frame into a long vector, and then array(..., dim = (num_samples, num_timesteps, num_features))? Is there a sensible way to unlist the dataframe that would put the elements in the proper order?
In my actual dataset, have some tens of thousands of samples, only 4 time steps, and a few dozen features.
Thanks much for any help.