I start with a data.table of 1M rows:
> str(pcapData2)
Classes ‘data.table’ and 'data.frame': 1000000 obs. of 26 variables:
...
$ ToTpf : logi TRUE TRUE TRUE TRUE TRUE TRUE ...
...
$ RemAdr : chr "10.201.138.166" "10.201.141.167" "10.201.138.166" "10.201.136.199" ...
...
$ LocalCt.Bin : Factor w/ 7 levels "2023-03-14 15:25:00.003021",..: 1 1 1 1 1 1 1 1 1 1 ...
...
$ serverPort : int 60069 60009 60083 60085 60081 60081 60081 60081 60081 60081 ...
- attr(*, ".internal.selfref")=<externalptr>
where serverPort
contain some NA
s.
I then select and filter from this:
pd2sfdf =
pcapData2 %>%
select( LocalCt.Bin, ToTpf, RemAdr, serverPort ) %>%
filter( !(serverPort %in% c(20, 21, 23, 25, 26, 53, 69)) )
to get:
> str(pd2sfdf)
Classes ‘data.table’ and 'data.frame': 995488 obs. of 4 variables:
$ LocalCt.Bin: Factor w/ 7 levels "2023-03-14 15:25:00.003021",..: 1 1 1 1 1 1 1 1 1 1 ...
$ ToTpf : logi TRUE TRUE TRUE TRUE TRUE TRUE ...
$ RemAdr : chr "10.201.138.166" "10.201.141.167" "10.201.138.166" "10.201.136.199" ...
$ serverPort : int 60069 60009 60083 60085 60081 60081 60081 60081 60081 60081 ...
- attr(*, ".internal.selfref")=<externalptr>
which contains 6xNAs:
> pd2sfdf %>% filter(is.na(serverPort))
LocalCt.Bin ToTpf RemAdr serverPort
1: 2023-03-14 15:25:30.003021 FALSE 10.27.186.35 NA
2: 2023-03-14 15:25:30.003021 TRUE 10.27.186.35 NA
3: 2023-03-14 15:25:45.003021 FALSE 10.128.0.129 NA
4: 2023-03-14 15:25:45.003021 TRUE 10.128.0.129 NA
5: 2023-03-14 15:26:00.003021 TRUE 10.242.98.72 NA
6: 2023-03-14 15:26:00.003021 FALSE 10.242.98.72 NA
However, if I write the original data.table to an sqlite database:
dbcon <- dbConnect( RSQLite::SQLite(), dbName )
dbWriteTable( dbcon, "pcapData2", pcapData2, overwrite = TRUE )
dbDisconnect( dbcon )
And then read it with:
dbcon <- dbConnect( RSQLite::SQLite(), dbName )
pd2sdb =
tbl(dbcon, "pcapData2") %>%
select( LocalCt.Bin, ToTpf, RemAdr, serverPort ) %>%
#filter( !(serverPort %in% c(20, 21, 23, 25, 26, 53, 69)) ) %>%
collect()
Whilst it is fine WITHOUT the filter():
> str(pd2sdb)
tibble [1,000,000 × 4] (S3: tbl_df/tbl/data.frame)
$ LocalCt.Bin: chr [1:1000000] "2023-03-14 15:25:00.003021" "2023-03-14 15:25:00.003021" "2023-03-14 15:25:00.003021" "2023-03-14 15:25:00.003021" ...
$ ToTpf : int [1:1000000] 1 1 1 1 1 1 1 1 1 1 ...
$ RemAdr : chr [1:1000000] "10.201.138.166" "10.201.141.167" "10.201.138.166" "10.201.136.199" ...
$ serverPort : int [1:1000000] 60069 60009 60083 60085 60081 60081 60081 60081 60081 60081 ...
> pd2sdb %>% filter(is.na(serverPort))
# A tibble: 6 × 4
LocalCt.Bin ToTpf RemAdr serverPort
<chr> <int> <chr> <int>
1 2023-03-14 15:25:30.003021 0 10.27.186.35 NA
2 2023-03-14 15:25:30.003021 1 10.27.186.35 NA
3 2023-03-14 15:25:45.003021 0 10.128.0.129 NA
4 2023-03-14 15:25:45.003021 1 10.128.0.129 NA
5 2023-03-14 15:26:00.003021 1 10.242.98.72 NA
6 2023-03-14 15:26:00.003021 0 10.242.98.72 NA
If I add in the filter(), I get:
> str(pd2sfdb)
tibble [995,482 × 4] (S3: tbl_df/tbl/data.frame)
$ LocalCt.Bin: chr [1:995482] "2023-03-14 15:25:00.003021" "2023-03-14 15:25:00.003021" "2023-03-14 15:25:00.003021" "2023-03-14 15:25:00.003021" ...
$ ToTpf : int [1:995482] 1 1 1 1 1 1 1 1 1 1 ...
$ RemAdr : chr [1:995482] "10.201.138.166" "10.201.141.167" "10.201.138.166" "10.201.136.199" ...
$ serverPort : int [1:995482] 60069 60009 60083 60085 60081 60081 60081 60081 60081 60081 ...
where all the NAs are gone:
> pd2sfdb %>% filter(is.na(serverPort))
# A tibble: 0 × 4
# ℹ 4 variables: LocalCt.Bin <chr>, ToTpf <int>, RemAdr <chr>, serverPort <int>
The filter generates the SQL:
SELECT `LocalCt.Bin`, `ToTpf`, `RemAdr`, `serverPort`
FROM `pcapData2`
WHERE (NOT((`serverPort` IN (20.0, 21.0, 23.0, 25.0, 26.0, 53.0, 69.0))))
So I'm suspecting the WHERE as being the culprit for dropping the NAs. Yet, if I test it with
df = tibble::tribble(
~A, ~B,
1, as.integer(1),
2, as.integer(2),
3, NA_integer_,
4, NA_integer_
)
dbcon <- dbConnect( RSQLite::SQLite(), dbName )
dbWriteTable( dbcon, "test", df, overwrite = TRUE )
tbl(dbcon, "test") %>%
filter( !(A %in% c(2, 5)) ) %>%
head(5000) %T>% explain() %>%
collect()
I get
> str(x)
tibble [3 × 2] (S3: tbl_df/tbl/data.frame)
$ A: num [1:3] 1 3 4
$ B: int [1:3] 1 NA NA
and
SELECT *
FROM `test`
WHERE (NOT((`A` IN (2.0, 5.0))))
LIMIT 5000
So, all in all, I'm very confused about where my NAs are going