0

I have a bitset of binary data that I wish to encode compactly as an ASCII string. I intend to initially compress the data using run-length encoding to give a sequence of integers; e.g.

111110001000000000000111

becomes:

5o3z1o12z3o

(e.g. 5 ones, 3 zeros, 1 one, 12 zeros, 3 ones).

However, I wish to then compress this further into a compact ASCII string (i.e. a string using the full range of ASCII characters rather than the digits plus 'o' and 'z'). Can anyone recommend a suitable approach and / or 3rd party library to do this in Java?

1 Answer 1

3

If your goal is compression, just gzip the stream. It's going to do better than your run-length encoding.

Then if you need it to be text for some reason, like to safely pass through old mail gateways, I'd also turn to a standard encoding like Base64, rather than make up your own.

But if you want to roll your own: first I'd note that you don't need the 'o' and 'z'. You already know those values since they alternate. Assume it starts on 0 (and if it doesn't, encode an initial 0 to show that there are 0 0s).

Encoding the numbers textually is possible but probably inefficient. Look into a variable-length encoding for integer values, then encode those bytes. Then 'escape' them into ASCII somehow.

But then we're back to Base64-like encoding, and the first suggestion to gzip + Base64 is probably easier than all of this.

Sign up to request clarification or add additional context in comments.

4 Comments

Thanks - I'll check it out. I notice there are non-public base64 encoders shipped in sun.misc. BTW I included the 'o' and 'z' delimiters assuming that the sequence of 0s or 1s was of arbitrary length rather than a fixed size integer.
Commons Codec (commons.apache.org/codec) has code to encode Base64. You can still have arbitrary-length integers without needing a full byte of delimiter between integers.
@Sean, if you were to encode arbitrary length integers without delimiters how would you know where one integer ended and the next one began? Only thing I can think of is that you'd need to use a variable length encoding scheme (e.g. using the top bit of each byte to denote "has more bytes").
That's exactly how it's usually done. Google protobufs do this for instance, all the way back to the MIDI protocol.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.