In memory inflate method fails even though writing same data to disk and unzipping is successful

scott-uses-git · November 15, 2024, 12:54pm

Hey folks,

I have some compressed data in zip format downloaded from a database as blobs (using varchar in SQL). Each blob is a zip archive containing a single file named "data.bin"

I am downloading this data in large batches (~100K), but each blob is relatively small (~100 bytes compressed, ~1000 bytes uncompressed).

I am trying to decompress the data in memory, since writing/unzipping/reading all files to/from disk is quite slow and inefficient.

However, I am not able to find any method in R that can accomplish this, despite the data being easy to unzip once written to disk.

Here is some code showing the problem; the first part runs successfully, whereas the second part does not work for any of the shown methods.

z is a raw vector of bytes containing one entry of compressed data

# Write to disk, unzip, read from disk (works)
t <- tempfile(fileext = ".zip")
writeBin(z, t)
zip::unzip(zipfile = t, "data.bin", exdir = dirname(t))
d <- file.path(dirname(t), "data.bin")
x <- read_file_raw(d)

# Unzip in memory (Does not work)
zip::inflate(z)
memDecompress(z, "gzip")
memDecompress(z, "bzip2")
zlib::decompress(z)

I was using Claude AI to try to debug this, I ended up with a function that dumps some useful info about the zip file header.

analyze_zip_structure <- function(bytes) {
    # Helper function to convert bytes to unsigned integer (little endian)
    bytes_to_int <- function(bytes) {
        sum(as.integer(bytes) * 256^(0:(length(bytes)-1)))
    }
    
    # Print first 50 bytes in decimal for analysis
    cat("First 50 bytes (as integers):\n")
    cat(as.integer(bytes[1:min(50, length(bytes))]), "\n\n")
    
    # Verify ZIP signature
    if (!identical(bytes[1:4], as.raw(c(0x50, 0x4B, 0x03, 0x04)))) {
        stop("Not a valid ZIP file signature")
    }
    
    # Parse key ZIP header fields
    cat("ZIP Local File Header Analysis:\n")
    cat("--------------------------------\n")
    cat("Signature (as integers):", as.integer(bytes[1:4]), "\n")
    cat("Version needed:", bytes_to_int(bytes[5:6]), "\n")
    cat("Flags (as integers):", as.integer(bytes[7:8]), "\n")
    cat("Compression method:", bytes_to_int(bytes[9:10]), "\n")
    cat("Last mod time (as integers):", as.integer(bytes[11:12]), "\n")
    cat("Last mod date (as integers):", as.integer(bytes[13:14]), "\n")
    cat("CRC-32 (as integers):", as.integer(bytes[15:18]), "\n")
    cat("Compressed size:", bytes_to_int(bytes[19:22]), "\n")
    cat("Uncompressed size:", bytes_to_int(bytes[23:26]), "\n")
    cat("Filename length:", bytes_to_int(bytes[27:28]), "\n")
    cat("Extra field length:", bytes_to_int(bytes[29:30]), "\n")
    
    # Get filename
    filename_length <- bytes_to_int(bytes[27:28])
    if (filename_length > 0) {
        filename <- rawToChar(bytes[31:(30 + filename_length)])
        cat("Filename:", filename, "\n")
    }
    
    # Calculate where compressed data should start
    header_end <- 30 + filename_length + bytes_to_int(bytes[29:30])
    cat("\nHeader ends at byte:", header_end, "\n")
    cat("Total file size:", length(bytes), "\n")
    
    return(header_end + 1)  # Return start position of compressed data
}

analyze_zip_structure(z)

First 50 bytes (as integers):
80 75 3 4 20 0 0 8 8 0 36 183 104 89 131 32 64 7 99 0 0 0 86 4 0 0 8 0 0 0 100 97 116 97 46 98 105 110 237 145 235 10 128 32 12 133 191 65 143 217 

ZIP Local File Header Analysis:
--------------------------------
Signature (as integers): 80 75 3 4 
Version needed: 20 
Flags (as integers): 0 8 
Compression method: 8 
Last mod time (as integers): 36 183 
Last mod date (as integers): 104 89 
CRC-32 (as integers): 131 32 64 7 
Compressed size: 99 
Uncompressed size: 1110 
Filename length: 8 
Extra field length: 0 
Filename: data.bin 

Header ends at byte: 38 
Total file size: 213

Here's what I think is going on (I could be wrong).

There is a difference in the zip format depending on whether data is zipped on disk vs in memory
Zipping on disk add a header and tail to the data, whereas zipping in memory does not.

The data I am getting from database into memory includes the header and tail (i.e. it was zipped on disk)

So in the code below, I need to figure out how to unzip z2 in memory

x <- charToRaw("this is a test string")

# Zip in memory
z1 <- zip::deflate(x)$output
# Unzip in memory
zip::inflate(z1)$output %>% rawToChar()

# Write to disk and zip
writeBin(x, "test.bin")
zip::zip("test.zip", "test.bin")
# Read zip binary data into memory - how to unzip z2 in memory???
z2 <- readBin("test.zip", "raw", n = 1e6)

I would greatly appreciate any help figuring this out, I've searched all over the internet and have not found a solution to this.

Thanks,
Scott

Gabor · November 15, 2024, 8:22pm

Also at In memory inflate method fails even though writing same data to disk and unzipping is successful · Issue #118 · r-lib/zip · GitHub

AlexisW · November 15, 2024, 8:31pm

I think it's due to the difference between zip and gzip. The functions you've been using are for gzip.

# some random data
raw_vec <- runif(100, 1, 16) |> as.raw()


# Prepare file names
name_raw <- tempfile()
name_zip <- paste0(name_raw, ".zip")
name_gzip <- paste0(name_raw, ".gz")



# save file as

# --> raw
writeBin(raw_vec, name_raw)

# --> zip
zip::zip(zipfile = name_zip, files = basename(name_raw), root = dirname(name_raw))

# --> gzip
R.utils::gzip(filename = name_raw, remove = FALSE)

list.files(dirname(name_raw))
#> [1] "file1a902b784b74"    "file1a9043d555f"     "file1a9043d555f.gz" 
#> [4] "file1a9043d555f.zip" "file1a904d362483"    "file1a90633c1d7e"



# Read raw
readBin(name_raw, what = "raw", n = 100) |> all.equal(raw_vec)
#> [1] TRUE


# Read zip
con <- unz(name_zip, filename = basename(name_raw), open = "rb") 

readBin(con, what = "raw", n = 1000) |>
  all.equal(raw_vec)
#> [1] TRUE

close(con)



# Read gzip

readBin(name_gzip, what = "raw", n = 1000) |>
  memDecompress(type = "gzip") |>
  all.equal(raw_vec)
#> [1] TRUE



con <- gzfile(name_gzip, open = "rb")
readBin(con, what = "raw", n = 1000) |>
  all.equal(raw_vec)
#> [1] TRUE

close(con)


# worked in a previous version, unsure what I'm doing wrong now
readBin(name_gzip, what = "raw", n = 1000) |>
  zip::inflate(size = 1000) |>
  all.equal(raw_vec)
#> Error in zip::inflate(readBin(name_gzip, what = "raw", n = 1000), size = 1000): Input data is ivalid

^{Created on 2024-11-15 with reprex v2.1.0}

AlexisW · November 16, 2024, 4:23am

Also, thinking about it a bit, since you're importing from a database I think you might be able to download directly into a duckdb database, which is then easy to query from R.

Or event more directly, if it's stored as a VARCHAR it doesn't have to become a BLOB; if you can send SQL requests to that database, can't you just send SQL queries from R to get the VARCHAR data needed?

scott-uses-git · November 16, 2024, 11:33am

Thanks for linking this. I went ahead and closed the issue, I think this forum is a better place for the question.

Thanks for this response. However, in my original post, I tried both zip::inflate(z) and memDecompress(z, "gzip") - neither of these work. I

AlexisW:

I think it's due to the difference between zip and gzip. The functions you've been using are for gzip.

# some random data
raw_vec <- runif(100, 1, 16) |> as.raw()


# Prepare file names
name_raw <- tempfile()
name_zip <- paste0(name_raw, ".zip")
name_gzip <- paste0(name_raw, ".gz")



# save file as

# --> raw
writeBin(raw_vec, name_raw)

# --> zip
zip::zip(zipfile = name_zip, files = basename(name_raw), root = dirname(name_raw))

# --> gzip
R.utils::gzip(filename = name_raw, remove = FALSE)

list.files(dirname(name_raw))
#> [1] "file1a902b784b74"    "file1a9043d555f"     "file1a9043d555f.gz" 
#> [4] "file1a9043d555f.zip" "file1a904d362483"    "file1a90633c1d7e"



# Read raw
readBin(name_raw, what = "raw", n = 100) |> all.equal(raw_vec)
#> [1] TRUE


# Read zip
con <- unz(name_zip, filename = basename(name_raw), open = "rb") 

readBin(con, what = "raw", n = 1000) |>
  all.equal(raw_vec)
#> [1] TRUE

close(con)



# Read gzip

readBin(name_gzip, what = "raw", n = 1000) |>
  memDecompress(type = "gzip") |>
  all.equal(raw_vec)
#> [1] TRUE



con <- gzfile(name_gzip, open = "rb")
readBin(con, what = "raw", n = 1000) |>
  all.equal(raw_vec)
#> [1] TRUE

close(con)


# worked in a previous version, unsure what I'm doing wrong now
readBin(name_gzip, what = "raw", n = 1000) |>
  zip::inflate(size = 1000) |>
  all.equal(raw_vec)
#> Error in zip::inflate(readBin(name_gzip, what = "raw", n = 1000), size = 1000): Input data is ivalid

^{Created on 2024-11-15 with reprex v2.1.0}

Maybe I miscommunicated something. The database I am querying stores this data as a character string of hex bytes. I am pulling it using dbplyr; I figured out that I have to wrap the column in varchar() to convert it to a blob, otherwise the character string gets truncated and doesn't pull all of the raw byte data. Maybe there is a better way to handle this Regardless, the data is stored as compressed bytes, so there's no getting around the need to decompress the data after the query.

I did find a Stack Overflow thread with pretty much the same issue, but no real solution.

I was messing around with connections based on this thread to try and unzip in memory, here's what I came up with - it still doesn't work, can't read from the unzipped connection. I also have no idea if this approach would even unzip in memory or if its writing to disk first.

closeAllConnections()

con <- gzcon(rawConnection(z))
x <- unz(con, "data.bin", open="")
readBin(x, "raw", 1e4)

Thanks,
Scott

scott-uses-git · November 16, 2024, 12:50pm

Here is an actual sample of my data

z <- as.raw(c(0x50, 0x4b, 0x03, 0x04, 0x14, 0x00, 0x00, 0x08, 0x08, 0x00, 0x2e, 0x3c, 0x31, 0x59, 0x74, 0x84, 0xc3, 0x10, 0x4f, 0x00, 0x00, 0x00, 0x56, 0x04, 0x00, 0x00, 0x08, 0x00, 0x00, 0x00, 0x64, 0x61, 0x74, 0x61, 0x2e, 0x62, 0x69, 0x6e, 0xab, 0x8e, 0x61, 0x60, 0x60, 0x60, 0x64, 0x04, 0x11, 0x0c, 0x10, 0x00, 0xa3, 0xf1, 0xb1, 0xe9, 0xcd, 0xc7, 0xc6, 0x06, 0xd1, 0x2c, 0x40, 0xf2, 0xdf, 0x7f, 0x46, 0x06, 0x10, 0xfc, 0x03, 0xa5, 0x41, 0xf0, 0x37, 0x12, 0xfb, 0x17, 0x12, 0x1b, 0x04, 0x7f, 0xd2, 0x98, 0xff, 0x0b, 0x87, 0x3b, 0x60, 0xee, 0x03, 0xb9, 0x97, 0x9a, 0x80, 0x91, 0x48, 0x31, 0x4a, 0xcc, 0xa3, 0x36, 0x20, 0xd7, 0x0e, 0xe4, 0x74, 0x4a, 0x8c, 0x39, 0x8c, 0x0c, 0xf4, 0xf1, 0xcf, 0x28, 0x40, 0x05, 0x00, 0x50, 0x4b, 0x01, 0x02, 0x33, 0x00, 0x14, 0x00, 0x00, 0x08, 0x08, 0x00, 0x2e, 0x3c, 0x31, 0x59, 0x74, 0x84, 0xc3, 0x10, 0x4f, 0x00, 0x00, 0x00, 0x56, 0x04, 0x00, 0x00, 0x08, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x64, 0x61, 0x74, 0x61, 0x2e, 0x62, 0x69, 0x6e, 0x50, 0x4b, 0x05, 0x06, 0x00, 0x00, 0x00, 0x00, 0x01, 0x00, 0x01, 0x00, 0x36, 0x00, 0x00, 0x00, 0x75, 0x00, 0x00, 0x00, 0x00, 0x00))

writeBin(z, "my-blob-data.zip")
identical(z, readBin("my-blob-data.zip", "raw", 1e4))

zip::unzip("my-blob-data.zip")
x <- readBin("data.bin", "raw", 1e4)

And, to clarify, here's what the data looks like as it's stored in the database

"504B03041400000808002E3C31597484C3104F0000005604000008000000646174612E62696EAB8E616060606404110C1000A3F1B1E9CDC7C606D12C40F2DF7F460610FC03A541F03712FB17121B047FD298FF0B873B60EE03B9979A809148314ACCA33620D70EE4744A8C398C0CF4F1CF28400500504B010233001400000808002E3C31597484C3104F00000056040000080000000000000000000000000000000000646174612E62696E504B0506000000000100010036000000750000000000"

scott-uses-git · November 19, 2024, 4:22pm

Would it be possible to do something like - edit the header and footer of the compressed data to be compatible with standard decompression methods? Or is the compression format completely incompatible?

Here was my attempt - it doesn't work, I'm pretty far out of my depth here trying to edit raw byte streams to work with decompression algorithms.

standardize_zip_bytes <- function(compressed_bytes) {
  # Extract CRC32 from original ZIP header (bytes 15-18)
  crc32 <- compressed_bytes[15:18]
  
  # Extract uncompressed length from original ZIP header (bytes 23-26)
  uncompressed_length_bytes <- compressed_bytes[23:26]
  
  # Standard gzip header (1F 8B 08 00 followed by 4 bytes timestamp)
  gzip_header <- as.raw(c(0x1F, 0x8B, 0x08, 0x00, 0x00, 0x00, 0x00, 0x00))
  
  # Extract the compressed data
  compressed_data <- compressed_bytes[38:length(compressed_bytes)]
  
  # Combine header, compressed data, and footer with extracted CRC and length
  standardized_bytes <- c(gzip_header, compressed_data, crc32, uncompressed_length_bytes)
  
  return(standardized_bytes)
}

g <- standardize_zip_bytes(z) 

writeBin(g, "gzip-test.gz")
zlib::validate_gzip_file("gzip-test.gz")

memDecompress(g, "gzip")

Gabor · November 19, 2024, 4:57pm

My understanding is that it is a completely different format.

system · February 17, 2025, 4:57pm

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.