Did you ever saw a char and thought: “Damn, 1 byte for a single char is pretty darn inefficient”? No? Well I did. So what I decided to do instead is to pack 5 chars, convert each char to a 2 digit integer and then concat those 5 2 digit ints together into one big unsigned int and boom, I saved 5 chars using only 4 instead of 5 bytes. The reason this works is, because one unsigned int is a ten digit long number and so I can save one char using 2 digits. In theory you could save 32 different chars using this technique (the first two digits of an unsigned int are 42 and if you dont want to account for a possible 0 in the beginning you end up with 32 chars). If you would decide to use all 10 digits you could save exactly 3 chars. Why should anyone do that? Idk. Is it way to much work to be useful? Yes. Was it funny? Yes.
Anyone whos interested in the code: Heres how I did it in C: https://pastebin.com/hDeHijX6
Yes I know, the code is probably bad, but I do not care. It was just a funny useless idea I had.
Interesting idea but type conversion and parsing is much more slower than wasting 1 byte. Nowadays memory is “free” and the main issue is the execution speed.
Fuck it. *uses
ulong
to store a boolean*I know. This whole thing was never meant to be very useful, and more like a proof of concept
Alignment wastes much more anyways
You would have done well with this kind of thinking in the mid-80s when you needed to fit code and data into maybe 16k!
As long as you were happy to rewrite it in Z80 or 6502.
Another alternative is arithmetic encoding. For instance, if you only needed to store A-Z and space, you code those as 0-26, then multiply each char by 1, 27, 27^2, 26^3 etc, the add them.
To unpack them, divide by 27 repeatedly, the remainder each time is each character. It’s simply covering numbers to base-27.
It wouldn’t make much difference from using 5 bits per char for a short run, though, but could be efficient for longer strings, or if encoding a smaller set of characters.
It sucks for me because my brain innately enjoys this kind of thinking
Play the Zachtronics programming games (Exapunks, Shenzen I/O, TIS-100) if you haven’t yet.
Bro reinventing sixbits
Did not knew that this existed, but yeah its kinda like that. Except that I only allow 5 characters.
Given you are only a-z with space and dot; you could do 5bit and pack 6 characters per 32bit word with space to spare
Have you heard of Proquints?
https://www.ietf.org/archive/id/draft-rayner-proquint-03.html
Funny how they have a typo in test vectors:
0x0000 -> babab
0xFFFF -> zvzuz
0x1234 -> damuh
0xF00D -> zabat
0xBEEF -> ruroz
This is hilarious. I’m not sure how often anyone would actually need to verbalize arbitrary binary data, but I do see an advantage over base64 since the English letter names are so often phonetically similar.
I did something like this once, in the course of a project whose purpose I don’t remember. Realising that 8-bit ASCII was wasted on the constrained alphabet of some kind of identifiers, I packed them into 6 bits, 1⅓ to a byte. I recall naming the code to do this “ShortASC”, pronounced “short-ass”
Back in the day those tricks were common. Some PDP-11 OS’s supported a “Radix-50” encoding (50 octal = 40 decimal) that packed 3 characters into a 16 bit word (40 codepoints=26 letters, 10 digits, and a few punctuation). So you could have a 6.3 filename in 3 words.
Does the efficiency of storage actually matter? Are you working on a constrained system like a microcontroller? Because if you’re working on regular software, supporting Unicode is waaaaaaaaaaay more valuable than 20% smaller text storage.
Unicode? Sir this is C, if the character doesn’t fit into a uint8 it’s scope creep and too hard
I do sometimes work with microcontrollers, but so far I have not encountered a condition where these minimal savings could ever be useful.
Oh god that switch statement. Here, let me come up with something better:
if (pChar >= 'a' && pChar <= 'z') { return pChar - 'a' + 10; } else if (pChar == ' ') { return 36; } else if (pChar == '.'){ return 37; } return 0;
Ah, thats cool. Did not knew you could that… Thanks.
The original switch statement is great as it is. Easy to understand at a quick glance.
Yes, it’s easy to understand, but this ifelse is much safer because it “handles” uppercase and digits by just returning 0. If you’d call OP’s function with something like ‘A’, you’ll get nonsense data because it doesn’t have a default. (I think it will return whatever value currently resides in eax)
Can easily be dealt with by adding a default case. You also figured out this undefined behavior because you easily understood the switch statement.
Fair.
A single line comment would make it as easy to understand, and much more flexible if you wanted to add handling upper case letters or digits. Or even remap the values to a more standard 6 bit encoding.
You could save 0.64 bit per char more if you actually treated you output as a binary number (using 6 bits per char) and didn’t go through the intermediary string (implicitly using base 100 at 6.64 bits per char).
This would also make your life easier by allowing bit manipulation to slice/move parts and reducing work for the processor because base 100 means integer divisions, and base 64 means bit shifts. If you want to go down the road of a “complicated” base use base 38 and get similar drawbacks as now, except only 5.25 bits per char.I was so triggered by the conversion from char-to-int-to-string-to-packedint that I had to write a bitwise version that just does char-to-packedint (and back again), with bitwise operators.
As others have pointed out, there are probably better options for doing this today in most real-life situations, but it might make sense on old low-spec systems if not for all the intermediate conversion steps, which is why I wrote this.
It’s all fun and games until the requirement changes and you need to support uppercase letters and digits as well.
I am constantly on how I can allow Uppercase, without significantly reducing the posiible amounts of chars
If you’re ever doing optimizations like this, always think in binary. You’re causing yourself more trouble by thinking in decimal. With n bits you can represent 2^n different results. Using this you can figure out how many bits you need to store however many different potential characters. 26 letters can be stored in 5 bits, with 6 extra possibilities. 52 letters can be stored in 6 bits, with 12 extra possibilities. Assuming you want an end of string character, you have 11 more options.
If you want optimal packing, you could pack this into 48 bits, or 6 bytes/chars, for 8 characters.
Well it’s certainly possible to fit both uppercase and lowercase + 11 additional characters inside an int (26 + 26 + 11 = 63). The you need a null terminating char, which adds up to 64, which is 6 bits.
So all you need is 6 bits per char. 6 * 5 = 30, which is less than 32.
It’s easier to do this by thinking in binary rather than decimals. Look into bit shifting and other bitwise operations.
Depending on the use-case you might also want to add special case value like @Redkey@programming.dev did in their example, and get kind of UTF-8 pages. Then you can pack lowercase to 5 bits, and uppercase and some special symbols to 10 bits, and it will be smaller if uppercase are rare
I mean… you’d get better results for large data sets by just using a known compression algorithm. This is only viable for situations where you only have a small amount of data, enough computation to run this conditional, but not enough computation to run compression/decompression.
We have a binary file that has to maintain compatibility with a 16 bit Power Basic app that hasn’t been recompiled since '99 or '00. We have storage for 8 character strings in two ints , and 12 character string in two ints and two shorts.
Damn, that are setups where you have to get creative.
I’m a simple man. Here’s my list of variable names:
Var1
Var2
Var3
Var4
Var5
…
C lets you do this by putting text in single quotes:
int foo = ‘Abcd’;
works
But that’s only four chars in four bytes. This absolute madlad has put five chars in four bytes.
Yes I know, the code is probably bad, but I do not care
That’s why we love it.
In typical C fashion, there’s undefined behavior in
turn_char_to_int
. xDdid not want to always scroll past that behemoth of a switch case xD