data.table::fwrite

I have a file that is 940,000 rows and 47 variables.

The file was 330MB to start the day. Later today I noticed it grew to over 900MB.

I grew worried. So I had a look. I haven't noticed anything. But I have now opened it with

data.table::fread and saved it with data.table::fwrite over and over. The file first grew to 1.7GB, then 3GB, then to 6GB and I don't know why.

The version of RStudio I have is.

RStudio 2023.06.2+561 "Mountain Hydrangea" Release (de44a3118f7963972e24a78b7a1ad48b4be8a217, 2023-08-22) for macOS
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) RStudio/2023.06.2+561 Chrome/110.0.5481.208 Electron/23.3.0 Safari/537.36

I am running a Mac 11.5.2

Cheers,

investigate if its any particular column(s)

library(lobstr)
library(tidyverse)

mem_my_table <- function(x){
  xn <- names(x)
  names(xn) <- xn
  map(xn,\(v)obj_size(x[[v]])) 
}

mem_my_table(ChickWeight)
2 Likes

thank you nirgrahamuk,

the 900MB file opens and I save it, it goes to 1.7GB.

I performed your function on both, thankyou. There is only one variable that is different. The r_name.

I summed the data for each variable and the 900MB file was 403 MB, the 1.7GB file was 470MB.

Cheers,

the next time I opened and saved it, the file went to 3GB, the sum went to 600MB and r_name went to 279MB.

I have no idea why.

Cheers,

investigate the contents of r_name; whats in there ?
numeric ? (unlikely?)
strings ? factors ? some exotic type ?

I might pull both versions of the columns into memory and try waldo::compare()

also potentially might learn something by reading and writing files with only r_name column (for ease) and compare the raw text content of the csv between 2 versions

r_name is a character. I have just checked, 1285 rows have r_name with length of char of over 500,000. but how does that happen and how does that keep multiplying when you save it?

vs

Screen Shot 2024-02-24 at 1.06.22 am

Unbelievable, a bit of a nightmare. I tried it again and the file size kept increasing. This time it was a different variable. tr_info.

It had over 4 million characters in a field by the time I open and saved it four more times, I don't know how it happens and I don't know why it multiplies. I am about to go to bed. But I have filtered them out and am back at 284MB for the file. I will see if it multiplies tomorrow.

Cheers,

This is becoming bizarre.

What does the original .csv look like if you open it in a text editor?

What happens if you subset the data, say to maybe 20.000 rows?

Next question is, if that data is not confidential, can you park it on a sobset of it, at something like dropbox or mediafire so that someone here can see if they can duplicate the problem?

Thanks for the response.

Yes I can do all that. The file looks like that following inside an editor.

I have a 100MB file that I can upload somewhere. It seems to add / double the number of quotation marks inside fields when I open and close it.

I've been using to open it.

h_td <- data.table::fread("/Users/Update DB/Active/h_td.csv")

and to close it, the following. So I will have a saveRDS version too.

saveRDS(a, file = "/Users/Update DB/h_td")
data.table::fwrite(a, "/Users/Update DB/h_td.csv", row.names = FALSE)

the RDS file has the following.

Screen Shot 2024-02-24 at 7.46.19 am

I went to using data.table::fwrite and fread a few weeks ago, before that I was using write.csv and read.csv

Please tell me if someone would like a copy?

Cheers,

The original file would have had the following in that field. At some point they started multiplying, I don't know when. Maybe from the beginning?

Cheers,

An easy fix for me now without knowing why double quotation marks double is to do a str_replace_all in r_name and tr_info?

Single quotation marks don't seem to double.

Cheers,

I'll take one. I'll send you a email address but a better idea is to upload to a file sharing site so that anyone can have a look at it.

The file looks like that following inside an editor.

I got the file you sent. How did you create it? It has the """"""" in r_name that I do not see in your screenshot.

Did this data come from your original .csv file?

I can confirm the doubling in size.
Original 94.8
First Save 184.9
Second Save 364.9

Initial data.table file size
85976800 bytes

If I read in the data with read.csv() I get a file size of
46589104 bytes and no increase in size if I save it.

The data.table object is 1.845 larger than the data.frame.

If I do the same with another reasonably sized .csv file I get a negligible difference in size.

There is something very wrong with that file but It's far beyond my ability to figure it out.

How was original .csv created?

The data came from the original file.

I reduced the number of rows in the file, essentially filtering out the longer nchar r_names. if you look above at the screen shot I provided, there is one in the middle, Seppelt ##################.

I also took a screen shot of the race name, its Seppelt "The Drives". From there every time I have bound new rows to it its been doubling the number of double quotations until it dominated the file size.

I take it this happens with double quotations when you save it with data.table, rds and write.csv? They just double?

Ah, I see. I only saved as .csv but I am getting the doubling.

If you want to really confuse yourself try this:

DT <- fread("long_char.csv", drop = "r_name")
object.size(DT)
fwrite(DT, "long2.csv")

The .csv file size drops to 12.5 MB.

My best guess is that you'll need to come up with a strategy to pre process the data as it currently exists because there is some character messing up how it is reading and/or writing.

A few other things to consider/investigate:

  • Are there backticks ` in the file?
  • Are there mismatched single quotes that look like double quotes?
  • Are there mismatched double quotes?
  • Are there mismatched commas?
  • Have you tried recreating the phenomenon using a csv with one row of data? Or even one row and one column of data (if it's one column in particular causing the issue). Exploring that may give you insight on where things are happening.

I am pretty sure it is the r_name column that is the villein--see my last post-- but I, currently, only have a rather under-powered laptop that makes investigating the problem close to impossible.

The key may be how the .csv file was generated.

It started out like the following and multiplied.

I just recreated the file without removing the double quotations. I now make sure I remove them.

structure(list(track_n = c("Main", "Main", "Main", "Main", "Main", 
"Main", "Main", "Main", "Main"), tr_ty = c("Turf", "Turf", "Turf", 
"Turf", "Turf", "Turf", "Turf", "Turf", "Turf"), r_tr_cond = c("Good 4", 
"Good 4", "Good 4", "Good 4", "Good 4", "Good 4", "Good 4", "Good 4", 
"Good 4"), r_sdist = c(600L, 600L, 600L, 600L, 600L, 600L, 600L, 
600L, 600L), r_tm = c("Electronic", "Electronic", "Electronic", 
"Electronic", "Electronic", "Electronic", "Electronic", "Electronic", 
"Electronic"), r_no = 1:9, date = structure(c(17901, 17901, 17901, 
17901, 17901, 17901, 17901, 17901, 17901), class = "Date"), br_tr_race = c(0L, 
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), venue = c("Ascot", "Ascot", 
"Ascot", "Ascot", "Ascot", "Ascot", "Ascot", "Ascot", "Ascot"
), state = c("WA", "WA", "WA", "WA", "WA", "WA", "WA", "WA", 
"WA"), r_name = c("CROWN PERTH PLATE", "TABTOUCH - BETTER YOUR BET HANDICAP", 
"MUMM CHAMPAGNE HANDICAP", "AMELIA PARK HANDICAP", "FURPHY-SUMMER SCORCHER", 
"MRS MAC'S-LA TRICE CLASSIC", "ASCEND SALES HANDICAP", "TABTOUCH-PERTH CUP", 
"SEPPELT \"THE DRIVES\" HANDICAP"), m_type = c("TAB", "TAB", 
"TAB", "TAB", "TAB", "TAB", "TAB", "TAB", "TAB"), m_type_l = c("Metro", 
"Metro", "Metro", "Metro", "Metro", "Metro", "Metro", "Metro", 
"Metro"), pm = c(70000, 70000, 70000, 70000, 1e+05, 150000, 70000, 
5e+05, 50000), r_dist = c(1000L, 1600L, 1600L, 2200L, 1000L, 
1800L, 1100L, 2400L, 1200L), r_age = c("2", "3", "3U", "3U", 
"3U", "3U", "3U", "3U", "3U"), wgt_cond = c("SWP", "HCP", "HCP", 
"HCP", "QTY", "SWP", "HCP", "QTY", "HCP"), r_class = c("~", "~", 
"BM72", "BM66", "LISTED", "GRP3", "BM66", "GRP2", "1MW"), r_hm = c(NA_character_, 
NA_character_, NA_character_, NA_character_, NA_character_, NA_character_, 
NA_character_, NA_character_, NA_character_), r_time = c(58.96, 
96.35, 96.58, 138.5, 57.74, 110.7, 64.97, 148.13, 70.85), r_stime = c(34.56, 
35.61, 35.42, 36.27, 33.09, 35.13, 34.8, 36.52, 35.16), r_id = 5134154:5134162, 
    time = c("12:08PM", "12:46PM", "1:26PM", "2:06PM", "2:43PM", 
    "3:18PM", "3:55PM", "4:35PM", "5:15PM"), m_id = c(4018099L, 
    4018099L, 4018099L, 4018099L, 4018099L, 4018099L, 4018099L, 
    4018099L, 4018099L), day = c("Saturday", "Saturday", "Saturday", 
    "Saturday", "Saturday", "Saturday", "Saturday", "Saturday", 
    "Saturday"), rail = c("Out 5 metres.", "Out 5 metres.", "Out 5 metres.", 
    "Out 5 metres.", "Out 5 metres.", "Out 5 metres.", "Out 5 metres.", 
    "Out 5 metres.", "Out 5 metres."), tr_type = c("Turf", "Turf", 
    "Turf", "Turf", "Turf", "Turf", "Turf", "Turf", "Turf"), 
    tr_cond = c("Good", "Good", "Good", "Good", "Good", "Good", 
    "Good", "Good", "Good"), weather = c("Fine", "Fine", "Fine", 
    "Fine", "Fine", "Fine", "Fine", "Fine", "Fine"), penetrometer = c(6.4, 
    6.4, 6.4, 6.4, 6.4, 6.4, 6.4, 6.4, 6.4), tr_info = c(NA_character_, 
    NA_character_, NA_character_, NA_character_, NA_character_, 
    NA_character_, NA_character_, NA_character_, NA_character_
    ), dual_tr = c("N", "N", "N", "N", "N", "N", "N", "N", "N"
    )), row.names = c(NA, -9L), class = c("tbl_df", "tbl", "data.frame"
))

yeh, when I was provided with.

mem_my_table <- function(x){
  xn <- names(x)
  names(xn) <- xn
  map(xn,\(v)obj_size(x[[v]])) 
}

mem_my_table(ChickWeight)

I added up all the column sizes and it wasn't adding up to the size of the file. But I was opening, adding to and saving this file 4-7 times a week and this weekend it just got out of control with how many double quotations that must have been in it.

Yes, I do not have any idea of what is happening. But to return to an earlier question, "How was that file created?"

My guess is that there is something wonky in how it is generated but I have no idea what. Is it from a database or some kind of web-scraping, or what?