If you open the docx
with a decompression program such as 7zip (yes, a .docx file is just a zipped folder with xml files in it), you can pinpoint the exact source of the difference.
In your example document, you will find that all the files are identical, except for docProps/core.xml
. And if you open that xml file, it might look like that:
<?xml version="1.0" encoding="UTF-8"?>
-<cp:coreProperties xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:dcmitype="http://purl.org/dc/dcmitype/" xmlns:dcterms="http://purl.org/dc/terms/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:cp="http://schemas.openxmlformats.org/package/2006/metadata/core-properties">
<dc:title>Test Reproducibility</dc:title>
<dc:creator>My Name</dc:creator>
<cp:keywords/>
<dcterms:created xsi:type="dcterms:W3CDTF">2020-12-18T02:03:14Z</dcterms:created>
<dcterms:modified xsi:type="dcterms:W3CDTF">2020-12-18T02:03:14Z</dcterms:modified>
</cp:coreProperties>
And if you compare files test
and test1
, you will see the difference is in the creation and modification dates. And these are not the same as the ones in the filesystem: if you use touch
(on a UNIX system), you change the modification date at the system level, but not the one saved inside the file. The only possibility could be if PANDOC provided an option to lie on that date, but I find it doubtful, and a quick search didn't show me anything.
So, I would suggest that if you want to compare a few files, you just open them in Word and run Compare Version of a Document
which should tell you "no difference", or if you need to automate it, you might use some bash
magic to unzip
the files and compute the md5 for everything except that core.xml
.
There should also be a way to do that comparison programmatically with COM, or .NET stuff. Or you can write an R function that does that:
file1 <- officer::read_docx("path/to/test.docx")
file2 <- officer::read_docx("path/to/test1.docx")
waldo::compare(file1, file2)
#> old$package_dir vs new$package_dir
#> - "PATH\\AppData\\Local\\Temp\\Rtmp6ZSKdx\\file5c74177821d4"
#> + "PATH\\AppData\\Local\\Temp\\Rtmp6ZSKdx\\file5c74714776c4"
#>
#> old$doc_properties$data | new$doc_properties$data
#> [16] "Test Reproducibility" | "Test Reproducibility" [16]
#> [17] "My Name" | "My Name" [17]
#> [18] "" | "" [18]
#> [19] "2020-12-18T02:03:14Z" - "2020-12-18T02:02:54Z" [19]
#> [20] "2020-12-18T02:03:14Z" - "2020-12-18T02:02:54Z" [20]
file1$doc_properties$data <- file1$doc_properties$data[-c(4,5),]
file2$doc_properties$data <- file2$doc_properties$data[-c(4,5),]
file1$package_dir <- NULL
file2$package_dir <- NULL
waldo::compare(file1, file2)
#> v No differences
Created on 2020-12-17 by the reprex package (v0.3.0)