I was curious if I had the same problem. I finally found some time to exhume an old perl script of mine called double-doc that finds files that have same md5sum. I ran it on 4 DW instances.
- one animal of a farm (internal, ie not visible from internet)
- one regular (internal)
- one regular on my raspberry pi (open on the internet, main site, closed wiki, very few edits, will probably be replaced by the 4th one at some point)
- one on my laptop with php embedded webserver.
Each of them has duplicate files in the cache directory. sometimes, just groups of 2, sometimes groups of tens of files (captcha plugin ???). They all use quite a lot of plugins, more or less the same, but not always.
My wikis are smaller (the biggest has 913 pages) and all have very few visits (being internal doc or my own notepad), The duplicate file problem does not have such proportions and is not an issue for me.
If you want to investigate, I'll be able to answer precisely your 4 questions, but probably not before next tuesday or wednesday. For now I'm just providing a few examples.
43623edb144015b47fabbf0b9e5e3b66
0644 1 33 33 10 Sun Apr 30 07:07:08 2023 cache/0/089d36cd5b707f9796a62bbdd2f5f517.metadata
0644 1 33 33 10 Sun Apr 30 07:07:08 2023 cache/3/34cd6680a52bee576c70a32764e3b674.metadata
0644 1 33 33 10 Sun Apr 30 07:07:08 2023 cache/1/163d5ffdca723c9e49ac16f07f164949.metadata
0644 1 33 33 10 Sun Apr 30 07:07:08 2023 cache/7/73eee45a4c85198a8a1ec57c71a7a360.metadata
0644 1 33 33 10 Sun Apr 30 07:07:08 2023 cache/9/9c3aac4da088d1ce063b9c44b9822777.metadata
0644 1 33 33 10 Sun Apr 30 07:07:08 2023 cache/8/8ebb48779373545c9bdac73134d15e4d.metadata
0644 1 33 33 10 Sun Apr 30 07:07:08 2023 cache/6/677c25fbf2ad81ba1f1340aa92572389.metadata
There is just a timestamp in the file. in this case 1682831228
and using gnu date we can see that this is precisely the date of metadata file timestamp.
date --date=@1682831228
dim. 30 avril 2023 07:07:08 CEST
some js,
3d6579bd80954b7c5284d2f1a4a6270c
0644 1 33 33 2048357 Sat May 6 20:51:53 2023 cache/b/bac012b4df145d11cabcdcb98db90f4c.js
0644 1 33 33 2048357 Thu Apr 27 15:51:25 2023 cache/4/497960298518bc0c9e19f53efc493120.js
0644 1 33 33 2048357 Sat May 6 20:51:21 2023 cache/6/641d4e4af3271a9446e125ffed32cfc4.js
some .feed, some .i
, some .xhtml
, some .captchaip
for the one facing internet
6dbe7af851d3a50aafcc3ed1b8f123bf
0644 1 33 33 36 Thu Feb 16 20:44:03 2023 ./8/83ccb05fa4ec9a2fde716196c824d13c.captchaip
0644 1 33 33 36 Wed Mar 8 20:59:37 2023 ./e/eb9219def69db464d365bee1eff25b5e.captchaip
0644 1 33 33 36 Tue Feb 28 21:07:31 2023 ./e/eb9f3590d08abd24db1228fd46cf0d07.captchaip
0644 1 33 33 36 Wed Feb 15 20:52:47 2023 ./2/209dd854c4d44307d9d34c210b204ad4.captchaip
0644 1 33 33 36 Wed Feb 1 06:37:15 2023 ./2/2a7e40663db062b25415dc3731dc25d2.captchaip
0644 1 33 33 36 Thu Feb 16 20:43:53 2023 ./4/4e9fdcb098cab8b39862773d769adea3.captchaip
95dcb1ae8aad1d0353de795aac6fb1db
0644 1 33 33 249350 Mon Sep 14 22:20:11 2020 ./7/7d08683edd1bbf776af23177e2412ef9.js.gz
0644 1 33 33 249350 Fri May 22 11:11:46 2020 ./8/8814619c8806b38a8ace218de8bfd086.js.gz
0644 1 33 33 249350 Sun Jun 14 07:43:51 2020 ./8/8ad7246e5869e6f7e527268181f55e57.js.gz
0644 1 33 33 249350 Fri May 22 11:13:52 2020 ./f/f550e36b81ffd2cf6f0449bfbd34bcf1.js.gz
b59c67bf196a4758191e42f76670ceba
0644 1 33 33 4 Tue Mar 22 23:47:08 2022 ./3/39010e8dab4aa3d05f7465bcc1962848.captchaip
0644 1 33 33 4 Sun Jun 12 03:14:14 2022 ./3/312ef59792ad3d927c8e81c3ec4bf579.captchaip
0644 1 33 33 4 Fri Apr 22 03:35:25 2022 ./3/34144732cc93c28c643a4fc3c2ba7d46.captchaip
0644 1 33 33 4 Mon Aug 29 14:00:31 2022 ./3/3ebfd785b54ffa6b51b01157a4510a97.captchaip
0644 1 33 33 4 Sat Apr 9 21:03:24 2022 ./3/3657e8ae1d5ba2dfee26023a9cb79274.captchaip
0644 1 33 33 4 Wed Dec 7 21:56:59 2022 ./3/3187df273069fdde58e2aa4acda5d97d.captchaip
0644 1 33 33 4 Tue Jun 7 21:42:00 2022 ./3/34cdf6cfd7fc231623695eb64f187c8d.captchaip
0644 1 33 33 4 Fri Jun 3 00:01:13 2022 ./3/3e4796fa27f4d5d3b668405507265d10.captchaip
0644 1 33 33 4 Thu Jun 9 21:35:43 2022 ./3/3b47c4d12fa3c6f13e965b9f40645e27.captchaip
0644 1 33 33 4 Mon Jun 20 06:01:44 2022 ./3/3ceeea1f729cf4285973e0ae87cdae95.captchaip
0644 1 33 33 4 Tue Jun 21 20:36:18 2022 ./3/3dfa4cc08427288d20544e2fa4141ce6.captchaip
and so on...
In this case I ran
double-doc -L cache |
awk 'NF > 1 {
x=$ NF; sub( /[^.]*\./, "", x )
freq[x]++
}
END {
for( ind in freq ) { printf "%25s %s\n", ind, freq[ind] }
}
' | sort -b -k 2,2nr -k 1,1
and the result was
captchaip 2219
i 260
xhtml 219
js 32
js.gz 32
css 13
css.gz 13
repo 9
media.90x71.png 6
media.110x94.crop.png 4
media.120x50.png 4
media.120x55.png 4
media.120x58.png 4
media.120x73.png 4
media.120x95.png 4
media.90x37.png 4
media.90x55.png 4
media.90x77.png 4
media.1032x435.crop.png 2
media.1032x438.crop.png 2
media.1053x503.crop.png 2
media.1053x508.crop.png 2
media.108x120.png 2
media.119x37.png 2
media.119x53.png 2
media.119x95.png 2
media.120x120.png 2
media.120x15.png 2
media.120x16.png 2
media.120x21.png 2
media.120x23.png 2
media.120x31.png 2
media.120x47.png 2
media.120x49.png 2
media.120x51.png 2
media.120x52.png 2
media.120x61.png 2
media.120x77.png 2
media.120x85.png 2
media.120x89.png 2
media.1858x578.crop.png 2
media.1868x767.crop.png 2
media.1868x778.crop.png 2
media.1872x561.crop.png 2
media.1874x853.crop.png 2
media.1874x858.crop.png 2
media.1878x354.crop.png 2
media.1878x359.crop.png 2
media.1888x230.crop.png 2
media.1888x236.crop.png 2
media.1888x330.crop.png 2
media.1888x335.crop.png 2
media.1892x1340.crop.png 2
media.1898x1516.crop.png 2
media.1900x253.crop.png 2
media.2098x839.crop.png 2
media.2198x1392.crop.png 2
media.381x520.crop.png 2
media.609x446.crop.png 2
media.659x302.crop.png 2
media.659x303.crop.png 2
media.664x287.crop.png 2
media.66x90.png 2
media.714x277.crop.png 2
media.714x279.crop.png 2
media.81x90.png 2
media.829x414.crop.png 2
media.895x995.crop.png 2
media.89x120.png 2
media.89x41.png 2
media.90x11.png 2
media.90x12.png 2
media.90x16.png 2
media.90x17.png 2
media.90x23.png 2
media.90x27.png 2
media.90x35.png 2
media.90x36.png 2
media.90x38.png 2
media.90x39.png 2
media.90x40.png 2
media.90x41.png 2
media.90x43.png 2
media.90x44.png 2
media.90x45.png 2
media.90x57.png 2
media.90x64.png 2
media.90x66.png 2
media.90x90.png 2
media.935x238.crop.png 2
media.935x241.crop.png 2
media.940x417.crop.png 2
media.940x418.crop.png 2
media.948x576.crop.png 2
media.960x463.crop.png 2
media.969x398.crop.png 2
media.974x592.crop.png 2
media.974x595.crop.png 2
metadata 2
cache/batchedit/_prune 1
captcha 1