About a strange data compression method

7 min readJun 7, 2019

Another user tries to write a new data piece to his hard drive, but there is no space left. He refuses to delete anything since ‘he will need it later’. What should we do in this situation?

Such a problem is not a unique one. We have terabytes of data laying somewhere on our hard drives and it is not going to shrink any time soon. But how unique is it? All in all, every file is just a bit sequence, and it is probable that every new one is not all that different from the already stored ones.

Sure, you should not explicitly search for existing pieces of data on a hard drive — even if it is not a totally predictable failure, at least not a practical one. On the other hand, if the difference is not that great, maybe we can somehow fit one into the other…

TL;DR — about a very strange data optimization technique.

About bits and a difference

If we look at two totally random pieces of data, we’ll find out about 50% of their bits on the same places are equal. Indeed, for any possible combinations — (‘0–0, 0–1, 1–0, 1–1’) — exactly a half has equal bits in it, simple as that.

But of course, if we would just take two files and try fitting one into the other, one of them would be lost. Tying to save changes will lead to us reinventing delta-encoding which is already existed for a long time without us. We can try to fit smaller bit sequence into the larger one but we would risk losing some critical segments.

Between what could we shift the difference? I mean, every new file is just a bit sequence, and we can do nothing with it (think of it as of compressed binary data). That’s why we should find such bits on our hard drive that could be changed without a need to store anything about the change and live through a data loss without serious outcomes. And there should be no need to change the file representation on a filesystem, only less sensitive information inside. But how would we fit one data into the other?

Fitting methods

To help us out there are lossy-compressed files. All that JPEGs, MP3s and others, although have already lost part of their data on initial encoding, still have many bits we can safely alter. It’s also possible to use advanced techniques to modify their parts in unnoticeable ways on different encoding steps. Wait. Advanced techniques… unnoticeable modifications… one data into the other…. are we talking about the steganography?

Really, embedding one information altering the other looks as similar to such methods as physically possible. Invisibility of the changes for a man’s senses is also nice-to-have in our task. But there is no need in secrecy: we only need to store more on the same media, anything else would only hurt the user.

That’s why, although they’re usable, we need to adapt any algorithm beforehand. How? I will answer that below with one of the most well-known existing methods and a common file format.

Compress it a bit more and I’ll kill you

If compress, then the most compressed thing in the wholeworld. Of course, I’m talking about JPEG. Not only there are a metric ton of tools and already existing embedding methods, but it also the most popular graphics format on the planet.

Still, we need to limit our field of action. No one likes these squares from too much of compression, that’s why we also should avoid re-encoding the data as much as we can and alter only already encoded data. To be more specific — integer coefficients left by the lossy operations: DCT and quantization; as one can see from the full encoding scheme below:

There are a lot of existing JPEG optimization techniques. For example, lossless (jpegtran) and “[kinda lossless](https://tinyjpg.com)"™©® (yeah, it is not) ones, but we should not care about any them. After all, if the user is ready to embed one information into the other to increase available space on his hard drive, he has either used them beforehand it or is not willing to lose a bit of quality

F5

We can find a whole family of algorithms with such criteria. One of the most advanced ones is F5 by Andreas Westfeld. It operates by processing the coefficients from the luminance component (the human eye is the least sensitive to it). Moreover, F5 uses advanced encoding scheme based on matrix encoding, allowing to make the fewer changes embedding the same information amount, the larger the container file.

The processing itself is reduced to decreasing coefficients absolute values on a condition (not even 50% of them are changed at all). It is the thing that allows F5 to optimize stored on your drive data. The fact is, the coefficient after such change would probably use less bits in thefile after Huffman encoding due to values statistical distribution in JPEG and zeroes encoding with RLE.

The only thing we need to change is a secrecy part (password-based permutation) to save computation resources, time and the ability to embed one piece of data into the set of multiple files. These modifications are probably not that interesting for the reader, so I will skip their description (if you want, read the source code).

High tech

To proof the concept, I implemented everything in pure C and optimized the result by both execution speed and memory consumption (you can’t even imagine how big decompressed jpegs are). Made it cross-platform by using libjpeg, pcre and tinydir. Thanks, guys. Everything could be built with asimple `make`. Windows guys should have something like MSYS, Cygwin or WSL.

The implementation is available in a command-line tool and a library forms. More on how to use the latter you can read in the GitHub repository linked at the end of the post.

So, how can one use it?

Carefully. Used for the compression images chosen by traversing the specified root directory and matching regular expression to their names. After the process, you can move, rename and copy in it any files you want, switch file- and operating systems, etc. But you should be careful and do not change the actual content. Loosing even one bit could lead to loss of all packed data. Or not, but who’s ready to risk it?

The utility will produce a specific archive file carrying all of the data required to unpack compressed data, including information about used images. It occupies only a few kilobytes of space on the hard drive at most and therefore has no real impact on the free space. But you should not modify, move or touch it at all as his path determines the root search folder.

You can analyze available capacity with the `-a` flag:

./f5ar -a [root directory] [Perl-compatible regex]

Pack your data with:

./f5ar -p [root directory] [Perl-compatible regex] [file to compress] [archvive file name]`

And unpack it with:

./f5ar -u [archive file] [unpacked file name]

Usage demo

To show you the effectiveness of this method I downloaded 255 totally free dogs photos from Unsplash and found big (45 meters long) pdf of the second volume Knuth’s The Art of Computer Programming.

The demonstration itself is pretty simple:

Transparent terminal emulatoris transparent

Hashes are equal and you can (and should) still read the book:

No, really, go ahead and start reading right now

As you can see, we came from the initial 633 + 36 == 669 megabytes of data on my hard drive to more enjoyable 551. Such a difference could be explained with decreased coefficients affecting following lossless data compression: we can lose a couple bytes in the resulting size by decreasing only a single value. However, it is still a loss of data, even if a small one, we should put up with.

Luckily, a human eye has no way of noticing them. You can try it out with your own or check out computed luminance difference by these links: original, with some data in it, difference (dimmer the color, less the difference in the block).

Instead of conclusion

Looking at all of those difficulties you are probably thinking that buying a hard drive or using cloud storage is way more simpler solutions. If so — you are 100% correct. But there is no guarantee that you’ll be able to buy one or even will have things like Internet access at every moment in life. Not even any country has it today. On the other hand, you can always count on data storage you already have at your home.