Tell HN: Deduplicating a 10.4 TiB game preservation archive (WIP)

30 points by DrFrugal 7 days ago | 53 comments

Hi folks,

I am working on a game preservation project, where the data set holds 10.4 TiB. It contains 1044 earlier versions of a single game in a multitude of different languages, architectures and stages of development. As you can guess, that means extreme redundancy.

The goals are: - bring the size down - retain good read speed (for further processing/reversing) - easy sharable format - lower end machines can use it

My choice fell on the BTRFS filesystem, since it provides advanced features for deduplication, which is not as resource hungry as ZFS. Once the data is processed, it no longer requires a lot of system resources.

In the first round of deduplication, I used "jdupes -rQL" (yes, I know what -Q does) to replace exact copies of files in different directories via hardlinks to minimize data and metadata. This got it down to roughly 874 GiB already, out of which 866 GiB are MPQ files. That's 99,08%... everything besides is a drop in the bucket.

For those uninitiated: this is an archive format. Representing it as a pseudo-code struct it looks something like this { header, files[], hash_table[], block_table[] } Compression exists, but it is applied to each file individually. This means the same file is compressed the same way in different MPQ archives, no matter the offset it happens to be in.

What is throwing a wrench into my plans of further data deduplication are the following points: - the order of files seems not to be deterministic when MPQ files were created (at least I picked that up somewhere) - altered order of elements (files added or removed at the start) causes shifts in file offsets

I thought for quite some time about this, and I think the smartest way forward is, that I manually hack apart the file into multiple extents at specific offsets. Thus the file would contain of an extent for: - the header - each file individually - the hash table - the block table It will increase the size for each file of course, because of wasted space at the end of the last block in each extent. But it allows for sharing whole extents between different archives (and extracted files of it), as long as the file within is content-wise the same, no matter the exact offset. The second round of deduplication will then be whole extents via duperemove, which should cut down the size dramatically once more.

This is where I am hanging right now: I don't know how to pull it off on a technical level. I already was crawling through documentation, googling, asking ChatGPT and fighting it's hallucinations, but so far I wasn't very successful in finding leads (probably need to perform some ioctl calls).

From what I imagine, there are probably two ways to do this: - rewrite the file with a new name in the intended extent layout, delete the original and rename the new one to take it's place - rewrite the extent layout of an already existing file, without bending over backwards like described above

I need is a reliable way to, without chances of the filesystem optimizing away my intended layout, while I write it. The best case scenario for a solution would be a call, which takes a file/inode and a list of offsets, and then reorganizes it into that extents. If something like this does not exist, neither through btrfs-progs, nor other third party applications, I would be up for writing a generic utility like described above. It would enable me to solve my problem, and others to write their own custom dedicated deduplicaton software for their specific scenario.

If YOU - can guide me into the right direction - give me hints how to solve this - tell me about the right btrfs communities where I can talk about it - brainstorm ideas I would be eternally grateful :) This is not a call for YOU to solve my problem, but for some guidance, so I can do it on my own.

I think that BTRFS is superb for deduplicated archives, and it can really shine, if you can give it a helping hand.

mappu 7 days ago | next |

It looks like your solutions so far expose the result as a plain, accessible filesystem. If you can relax this constraint you might have an easier time -

"Precomp" is a program that losslessly expands a zip/gzip archive in a way that can be 1:1 restored, but in its expanded format, is more amenable to solid compression.

Could you write a custom archiver for the MPQ files that does something similar - expands them out, in a reversible way, to make the files appear more similar? Then long-range solid compression (e.g. zpaq) can take care of the rest,

Or, your custom archiver could extract the blobs to individual files, and then you'd only need to store each inner stream once, while still being able to reconstruct the original MPQ on demand.

karteum 7 days ago | root | parent | next |

I was also thinking about extracting the MPQ on-disk e.g. "foo.mpq" -> ".foo.mpq.extracted/..." (which - to my understanding - would lead to a lot of deduplication even just with jdupes). N.B. depending on the requirements with regards to read performance, you could design a FUSE filesystem that would transparently reconstruct the MPQ on-the-fly.

Of course, I guess all of this would be much slower than being able to use inherent btrfs features as the author of the post wishes, but at least it would be independent from btrfs :)

DrFrugal 6 days ago | root | parent |

prying apart the MPQ file into it's parts and writing them down to disk is at the moment probably the strongest contender for my solution. it will cause the parts to be block aligned, be able to be hardlinked and cut down on metadata and improve performance (same inode when hardlinked). only thing i need in this case is to have a script to reassamble those extracted ones into the original MPQ archives, which have to match byte-by-byte to the original content ofc. extracting them into it's distinct parts also allows to access the contents directly, if so desired, without needing to extract them on demand (some people wanna look up assets in specific versions). these distinct parts can then additionally be deduplicated on block level as well.

eb0la 6 days ago | root | parent |

I was curious about MPQ format and found this tool: https://github.com/eagleflo/mpyq

Maybe you can use it to decompress some files and assess how much disk space you can save using deduplication at filesystem level.

If it's worth the effort (1 mean, going from 1TiB to 100-200 Gib) I would consider coding the reassembly part. It can be done by a "script" fist, then can be "promoted" to a FUSE filesystem if needed.

DrFrugal 7 days ago | root | parent | prev |

you are absolutely on point - i would prefer having a real filesystem with deduplication (not compression), which offers data in a compact form, with good read speed for further processing.

i was already brainstorming of writing a custom purpose-built archive format, which would allow me to have more fine grained control over how i can lay out data and reference it. the thing is that this archive is most likely not absolutely final (additional versions being added) - having a plain filesystem allows for easier adding of new entries. an archive file might have to be rewritten.

if i go the route of custom archive, i can in theory write a virtual filesystem for it to access it read only like it would be a real filesystem... and if i design it properly, maybe even write it.

still would prefer to use a btrfs filesystem tbh ^^ will brainstorm a bit more over the next days - thanks for your input!

Intralexical 6 days ago | root | parent |

This is good thinking, but I think you are basically describing a Restic/Borg respository :)

- Deduplication? Check.

- Compact format? Check.

- Good read speed? Yep. (Proportional to backing store.)

- Custom purpose-built? Yeah, that's what backup programs are for.

- Custom data layout? Check. (Rabin/BuzHash content-defined chunking, SHA256 dedupe.)

- Adding additional versions? Yes. ("Incremental backups"— See above.)

- Virtual filesystem access? `restic mount`/`borg mount`.

DrFrugal 6 days ago | root | parent | next |

thanks for mentioning those 2 projects, will check them out over the holidays and do some experimenting ^^

mappu 6 days ago | root | parent | prev |

The same storage system is used by some non-backup software too, Perkeep and Seafile

Intralexical 6 days ago | prev | next |

Options:

- `zstd --train` a dictionary over all the files, then compress it all using the dictionary. Decompress on demand. Make sure you don't lose the dictionary, or you'll lose all of it.

- OR stick it into a Restic/Borg/Kopia/Deduplicacy repository. These use a rolling hash to deduplicate any subsections of files that are the same, before applying compression. You can try it with `--dry-run` first to see how much space you would save. Use their `mount` command to access the files as a directory.

I would not be surprised if it gets the total size down close-ish to the size of only a single copy of the game.

See the other comment about `precomp` if compression gives you issues.

Personally I would just go with the `restic` route, as that's what it's meant for.

DrFrugal 6 days ago | root | parent |

will probably write the MPQ blobs down to disk and deduplicate via hardlinks and additionally on block level. i don't know about restic (or borg, which was also recommended), but i will read up on it and doe some tests with it, regardless, since this seems to be a very nice tool for a lot of problem scenarios. thanks for the input!

canucker2016 2 days ago | prev | next |

I think you should tighten up your goals - state specific benchmarks as goals:

- bring the size down - you're more than 90% smaller than original, where do you stop? the low-hanging branches have been pruned via jdupes, looks like trying to slim down MPQ redundancy will require more compute/runtime. How much are you willing to allow to get an extra few percent?

- easy sharable format - BTRFS - doesn't sound like an easy sharable format, especially the hardlinks. How will you share it? Even several tens\hundreds of GBs can be expensive if people are downloading it often.

- what's your definition of a lower end machine? SSD? or spinning rust hard disk allowed? How much RAM? What minimum CPU? Linux\Unix only?

canucker2016 2 days ago | root | parent |

Wouldn't a WoW/Blizzard-programming-specific forum be a more relevant community to get help?

Looking at the docs for the MPQ file format, it looks like recreating exact copies of every MPQ file may be more work than its worth - space-savings-wise.

As you suggest, carving out the game-specific art assets (probably the cut-scene/cinematics take up much of the space), replace with zeros or some other easily compressible data, and compressing the remaining MPQ husk for archiving will save you much with relatively little work/time.

Uncompress the archived MPQ husk and fill in with the assets to get back to square one.

Then you'd just dedupe the art assets.

So, some tool/script which can tell you the byte range within the MPQ file for each of those assets will get you much of the way.

sillystuff 7 days ago | prev | next |

Instead of dupremove, bees may be more appropriate for your goals.

https://github.com/Zygo/bees

Bees does things at a lower level than just reflinking full files like dupremove (it dedups at the level of extents). Be sure to read the links in the "Recommended Reading" section, of the above link, to hopefully avoid issues.

DrFrugal 7 days ago | root | parent |

duperemove has "--dedupe-options=partial" which also enables this, not just full extents. the issue still is, that the data within the archive is not block aligned, thus preventing me from deduplicating them properly

rlupi 6 days ago | prev | next |

If the MPQ files are not too big and they match across languages, architectures and stages of development (e.g. same file names), the simplest solution is just to compress related files together. Let the compression algorithms figure out the symbols of your alphabet (eventually they will match your compressed files within MPQ archives, or their components). Restoring a specific version of the game then becomes "extract this list of files from archives and place them at these locations."

Random idea (I haven't tested it): If you need to figure out how to cluster together related MPQ files, you could compute merkle trees (or multiple merkle trees shifted by prime number offsets to identify common subsequences at arbitrary offsets), and use selected hashes for similarity search.

DrFrugal 6 days ago | root | parent |

even though I do not like it too much, i think i will have to pry apart the MPQ files into it's distinct parts, and write them down to the filesystem individually (and then deleting the original file) - basically what i wanted to do with the extents, but instead have them as distinct files. this then can be reversed via script to assemble the original archive file on demand to get a byte-by-byte equal file again. writing the parts down to the filesystem will cause the parts to be properly block aligned, and be able to be hardlinked, if they exist multiple times on the filesystem - this cuts down on metadata even more and also boosts performance when doing block/extent deduplication, since a single inode is only processed once in most proper deduplication programs.

the MPQ files range from a few MiB to around 2.5 GiB. since the access should be rather fast, pairing them as an archive file is not an option for me.

thanks about the hint of the merkle trees, i will read up on what that is... always good to know about different approaches to a problem :)

zzlk 6 days ago | root | parent |

I'm not sure about how it works in WoW but I can give you a warning about what happens in StarCraft. To prevent other people from editing StarCraft maps (which are MPQs), users would intentionally mangle the MPQ format in just the right way so that the game could still play it but other tools could not open it for editing. So, if there is anything like that going on in WoW world then it might be very hard to reassemble the original MPQs and get a byte for byte match.

DrFrugal 6 days ago | root | parent |

shouldn't be the case, and i would implement a proper verify when exploding the MPQ file, by running a reassamble and hash comparison right afterwards.

tricked 4 days ago | prev | next |

This seems like the perfect use-case for DwarFS : "DwarFS is a read-only file system with a focus on achieving very high compression ratios in particular for very redundant data."

https://github.com/mhx/dwarfs

yjftsjthsd-h 2 days ago | root | parent |

I thought of that too, especially since the use case is almost the same:

> I started working on DwarFS in 2013 and my main use case and major motivation was that I had several hundred different versions of Perl that were taking up something around 30 gigabytes of disk space, and I was unwilling to spend more than 10% of my hard drive keeping them around for when I happened to need them.

julik 3 days ago | prev | next |

It looks like you are basically outlining the solution with the description of the file format.

If it is { header, files[], hash_table[], block_table[] } then it would mean that you likely can de-duplicate based on the block_table or the hash_table (not familiar with the format itself). It may be prudent to check how many MPQ files have the same blocks inside them (checksumming or hashing, comparing on multiple checksums). If there is some overlap and some blocks are present in multiple files, you can extract those blocks and allow any version of the file be reconstituted back into an MPQ file - as long as you know which blocks to combine and in which order.

pavel_lishin 3 days ago | prev | next |

> jdupes

This might solve a problem I have! Whenever I use Google Takeout, it generates a new copy of a photo in Google Photos for every album that it's in, which means it takes up significantly more space than it should!

I've used fdupes to calculate just how much duplication is happening, and apparently out of 650+ gigabytes of photos, fully 250 gigabytes are duplicates.

DrFrugal 2 days ago | root | parent |

jdupes is quite powerful and allows you to create different types of links and even more. i used it for creating hardlinks, since this was the most efficient for my use case. from what i read, there is currently also a rewrite of jdupes in progress, which will introduce significant performance improvements... you can see a post about it here:

https://www.jdupes.com/ Working toward jdupes 2.0

btdmaster 2 days ago | prev | next |

Interesting, though a bit outside what you want (not random access): https://gwern.net/sort

Teever 3 days ago | prev | next |

It sounds like you're trying to make a more efficient back end for something like eXoDos[0]

You could try reaching out to the developers there and see if they have any solutions.

[0] https://www.retro-exo.com/exodos.html

DrFrugal 2 days ago | root | parent |

thanks for the link, but i think i found my favored solution right now, which is extracting the archive into raw files (and header, hashtable, blocktable) and then reassembling the files on demand into a byte-for-byte equal archive on demand (or via virtual file system). this will block align everything, give people access to raw assets and is flexible and performant on the filesystem because of hardlinking. appreciate your help though :)

GranPC 2 days ago | root | parent |

A virtual filesystem to "simulate" the MPQ files on-demand based off the raw deduplicated assets was the exact thing that came to mind when I read the OP. Happy to help on this. Email in profile.

rgovostes 7 days ago | prev | next |

There are open source libraries for interacting with MPQ, e.g., https://github.com/mbroemme/mpq-tools, if you’re not already using one.

DrFrugal 6 days ago | root | parent |

thanks for the link - i will write my extraction code though, since the format is very simplistic, and it gives me fine grained control over how things are done. appreciate your help though!

rgovostes 6 days ago | prev | next |

It seems that the filesystem-level deduplication approach with BTRFS conflicts with your goal of having an easily-shareable format. A filesystem is not an ideal archive format, especially one that is (~)only compatible with Linux systems. Also, over a long enough time horizon, filesystems become obsolete and difficult to mount.

Secondly, it seems like it is going to take contortions to utilize BTRFS deduplication with MPQ archives, hacking archives apart into separate extents, fighting with wasted space, etc. You're talking about low level filesystem manipulations to optimize it for this peculiar use case. And then you would need custom tooling anyway to be able to re-constitute the MPQ files from this filesystem of exploded fragments.

As a simpler, filesystem-agnostic approach, I would move all of the file data into a large content-addressable storage blob, which is by construction deduplicated. Then, strip the file data out of the MPQ archives and leave pointers into this blob. The process to reconstitute an archive is straightforward.

This approach does not rely on filesystem-specific deduplication, is portable to other platforms, and largely avoids the problem of wasted space due to filesystem structures. However, it would be a non-standard archive format, and you would have to make sure your scripts are well-documented enough that they will be usable in the future.

In more detail, to prepare your collection, you would write a script to do these steps:

1. Enumerate each file in each MPQ archive and compute the SHA-256 hash of its data.

2. If this hash has never been seen before, concatenate the file to your large storage blob. Record the hash => (offset, length) mapping to an index file.

3. Strip the file data out of the MPQ archive, leaving behind the SHA-256 hash as pointer you will use later to look up the (offset, length) in the index and recover the file.

4. Record the SHA-256 hash of the entire original MPQ archive, so you can later verify you've recreated it without loss of data integrity.

To reverse the process, you just need to pull pieces out of the storage blob and reconstitute the original MPQ archive. Check the hash of the recreated file against the original.

The stripped archives should be of negligible size and no further attempt to compress or deduplicate them is necessary. That said, you should try to compress your storage blob with something like the Zstandard Seekable Format. This will work best if you decompress the file data before hashing/storing it, and then re-compress it later when reconstituting the MPQ archive.

DrFrugal 6 days ago | root | parent | next |

a BTRFS subvolume can be exported as a snapshot, and this can be passed around rather easily. i know the concern mounting BTRFS, but there are windows drivers for that as well, and you can also mount it via WSL nowadays, to have proper linux tooling.

the approach of having a custom storage blob with pointer references which is something i consider as well, will play around with that during the holidays and do some experimenting. thanks for your input.

canucker2016 2 days ago | root | parent |

Definitely document\run-through what your intended audience will have to do to get your data into a state usable for them.

It also lets the user know how long this will take.

Why a content-addressable blob rather than just a directory with entries named after the SHA (which you can later tar up if desired)?

rgovostes 3 days ago | root | parent |

That would work as well. I'm not sure if extraction of a file by filename from a compressed tarball is particularly efficient or not.

hamandcheese 2 days ago | root | parent |

Ahh I see. If fast random access is a requirement that makes things a bit more tricky, yes.

nolist_policy 5 days ago | root | parent | prev |

You just described git-annex.

rgovostes 5 days ago | root | parent |

Basically I described how any content-addressable storage system works (including git), with the complication that our content is spread across various archive files, MPQ here but just as applicable to tarballs or anything else.

The difference between my suggestion and how BTRFS dedupes is that the FS does it on a block basis, which might work better for some kinds of files, but for game assets I think doing it on a file basis is good enough if not better.

zzlk 7 days ago | prev | next |

I'm surprised to see MPQ come up. What game is it?

DrFrugal 6 days ago | root | parent |

World of Warcraft - from the earliest (publicly leaked) Alpha up and including the very last Wrath of the Lich King version. Other people are already working on getting backups of the next 2 expansions working again to add those client versions to the archive as well.

moonshadow565 7 days ago | prev | next |

Lookup FICLONERANGE ioctl

DrFrugal 7 days ago | root | parent |

this is either a very big coincidence, or you are in the datamining discord as well. the original archive i base my project on uses RMAN to store everything :D --- thanks for the hint about the FICLONERANGE ioctl... it seems to be fine grained enough to allow me deduplicate on arbitrary offsets, not just whole blocks. will give it a go.

moonshadow565 7 days ago | root | parent |

Btrfs is something i originally wanted to use but other people were not fans of linux so custom ad-hoc tooling (RMAN) it was.

DrFrugal 7 days ago | root | parent |

i tried FICLONERANGE via a python wrapper btw - it turns out, that i can only clone ranges aligned to block boundaries :(

BTRFS is very neat per se, but documentation and help (most of all in very niche cases like this one here lol) is not that easy to come by. my plan would be to properly process the data set, and then make it available as a BTRFS snapshot... you can export btrfs send as a file as well for storage etc.

if all my tries to use BTRFS fail, i might to write my own tooling and virtual filesytem as well, but optimized for my use case (MPQ files and such). thanks for your input so far.

brudgers 7 days ago | prev |

Deduplication is a poor archival strategy. Storage is cheap. 10TiB is a couple of thousand dollars with reasonable redundancy.

Write Once is how to manage an archive.

Displaying a curated subset is a good interface. Good luck.

DrFrugal 7 days ago | root | parent |

this was not helpful at all, and i think you also did not read the goals of this project

brudgers 7 days ago | root | parent | next |

I assumed the goal was archiving games for preservation.

If the goal is algorithmic erasure of data rather than preservation of games, then “archive” might again create confusion like the type I probably have.

If there is a strong business case for deduplication, I recommend hiring a consultant with expertise and experience in the problem.

To be clear search is the only way to identify redundancy. If you have search, then redundancy is not a problem.

rcxdude 7 days ago | root | parent |

Often these archive projects have the goal of propagating the archive across many different systems, for which reducing size is very valuable. This is basically a compression exercise in the case where there are many large duplicated blocks.

DrFrugal 6 days ago | root | parent | next |

This is correct - the main goal is to have this rather compact, but still having good read times. This will allow me to store it on my new 4 TiB NVMe drive. A lot of iterative scanning will happen, because I search for interesting information, which helps reverse engineering. Also it allows me to share this with other people over the internet before I kick the bucket in a few decades... transferring 10.4 TiB would be rather boring :D

brudgers 7 days ago | root | parent | prev |

Thanks, that makes sense.

The question began, I am working on a game preservation project.

canucker2016 2 days ago | root | parent |

The line you referenced is a statement - not a question.

The third line of the post is "The goals are: - bring the size down - retain good read speed (for further processing/reversing) - easy sharable format - lower end machines can use it"

jrm4 7 days ago | root | parent | prev |

I think the confusion here (for me too) is what precisely is the purpose/meaning of "bring the size down?"

aka, is the is a "make it easier to search" question or "we need to take up fewer bytes" question (which, perhaps, used to be the same question, but aren't necessarily now?)

rcxdude 7 days ago | root | parent | next |

I think it's "I want people to make copies of this"

DrFrugal 6 days ago | root | parent | prev |

it's a question of both "i can store this on my 4 TiB NVMe disk" and "i can share this with other people over the internet in a timely manner"