Yes, I did spend time on this

da_cow (she/her)@feddit.org · edit-2 2 months ago

Yes, I did spend time on this

bacon_pdp@lemmy.world · 2 months ago

deleted by creator

da_cow (she/her)@feddit.org · 2 months ago

Did not knew that this existed, but yeah its kinda like that. Except that I only allow 5 characters.

bacon_pdp@lemmy.world · 2 months ago

deleted by creator

SolidGrue@lemmy.world · 2 months ago

Have you heard of Proquints?

https://www.ietf.org/archive/id/draft-rayner-proquint-03.html

BatmanAoD@programming.dev · 2 months ago

This is hilarious. I’m not sure how often anyone would actually need to verbalize arbitrary binary data, but I do see an advantage over base64 since the English letter names are so often phonetically similar.

lad@programming.dev · edit-2 2 months ago

Funny how they have a typo in test vectors:

0x0000 -> babab

0xFFFF -> zvzuz

0x1234 -> damuh

0xF00D -> zabat

0xBEEF -> ruroz

AllNewTypeFace@leminal.space · 2 months ago

I did something like this once, in the course of a project whose purpose I don’t remember. Realising that 8-bit ASCII was wasted on the constrained alphabet of some kind of identifiers, I packed them into 6 bits, 1⅓ to a byte. I recall naming the code to do this “ShortASC”, pronounced “short-ass”

bandwidthcrisis@lemmy.world · 2 months ago

You would have done well with this kind of thinking in the mid-80s when you needed to fit code and data into maybe 16k!

As long as you were happy to rewrite it in Z80 or 6502.

Another alternative is arithmetic encoding. For instance, if you only needed to store A-Z and space, you code those as 0-26, then multiply each char by 1, 27, 27^2, 26^3 etc, the add them.

To unpack them, divide by 27 repeatedly, the remainder each time is each character. It’s simply covering numbers to base-27.

It wouldn’t make much difference from using 5 bits per char for a short run, though, but could be efficient for longer strings, or if encoding a smaller set of characters.

SubArcticTundra@lemmy.ml · 2 months ago

It sucks for me because my brain innately enjoys this kind of thinking

Cethin@lemmy.zip · 2 months ago

Play the Zachtronics programming games (Exapunks, Shenzen I/O, TIS-100) if you haven’t yet.

solrize@lemmy.ml · 2 months ago

Back in the day those tricks were common. Some PDP-11 OS’s supported a “Radix-50” encoding (50 octal = 40 decimal) that packed 3 characters into a 16 bit word (40 codepoints=26 letters, 10 digits, and a few punctuation). So you could have a 6.3 filename in 3 words.

BartyDeCanter@lemmy.sdf.org · 2 months ago

Oh god that switch statement. Here, let me come up with something better:

if (pChar >= 'a' && pChar <= 'z') {
  return pChar - 'a' + 10;
} else if (pChar == ' ') {
  return 36;
} else if (pChar == '.'){
  return 37;
}
return 0;

da_cow (she/her)@feddit.org · 2 months ago

Ah, thats cool. Did not knew you could that… Thanks.

PastafARRian@lemmy.dbzer0.com · 2 months ago

First rule of code review, do not sound judgemental.

magic_lobster_party@fedia.io · 2 months ago

The original switch statement is great as it is. Easy to understand at a quick glance.

Karyoplasma@discuss.tchncs.de · edit-2 2 months ago

Yes, it’s easy to understand, but this ifelse is much safer because it “handles” uppercase and digits by just returning 0. If you’d call OP’s function with something like ‘A’, you’ll get nonsense data because it doesn’t have a default. (I think it will return whatever value currently resides in eax)

magic_lobster_party@fedia.io · 2 months ago

Can easily be dealt with by adding a default case. You also figured out this undefined behavior because you easily understood the switch statement.

Karyoplasma@discuss.tchncs.de · 2 months ago

Fair.

BartyDeCanter@lemmy.sdf.org · edit-2 2 months ago

A single line comment would make it as easy to understand, and much more flexible if you wanted to add handling upper case letters or digits. Or even remap the values to a more standard 6 bit encoding.

RiQuY@lemmy.zip · 2 months ago

Interesting idea but type conversion and parsing is much more slower than wasting 1 byte. Nowadays memory is “free” and the main issue is the execution speed.

da_cow (she/her)@feddit.org · 2 months ago

I know. This whole thing was never meant to be very useful, and more like a proof of concept

rtxn@lemmy.world · 2 months ago

Fuck it. *uses ulong to store a boolean*

Tja@programming.dev · 2 months ago

So, python?

jenesaisquoi@feddit.org · 2 months ago

Alignment wastes much more anyways

anton@lemmy.blahaj.zone · 2 months ago

You could save 0.64 bit per char more if you actually treated you output as a binary number (using 6 bits per char) and didn’t go through the intermediary string (implicitly using base 100 at 6.64 bits per char).
This would also make your life easier by allowing bit manipulation to slice/move parts and reducing work for the processor because base 100 means integer divisions, and base 64 means bit shifts. If you want to go down the road of a “complicated” base use base 38 and get similar drawbacks as now, except only 5.25 bits per char.

Redkey@programming.dev · 2 months ago

I was so triggered by the conversion from char-to-int-to-string-to-packedint that I had to write a bitwise version that just does char-to-packedint (and back again), with bitwise operators.

https://pastebin.com/V2An9Xva

As others have pointed out, there are probably better options for doing this today in most real-life situations, but it might make sense on old low-spec systems if not for all the intermediate conversion steps, which is why I wrote this.

Ethan@programming.dev · 2 months ago

Does the efficiency of storage actually matter? Are you working on a constrained system like a microcontroller? Because if you’re working on regular software, supporting Unicode is waaaaaaaaaaay more valuable than 20% smaller text storage.

ryannathans@aussie.zone · 2 months ago

Unicode? Sir this is C, if the character doesn’t fit into a uint8 it’s scope creep and too hard

da_cow (she/her)@feddit.org · 2 months ago

I do sometimes work with microcontrollers, but so far I have not encountered a condition where these minimal savings could ever be useful.

magic_lobster_party@fedia.io · 2 months ago

It’s all fun and games until the requirement changes and you need to support uppercase letters and digits as well.

da_cow (she/her)@feddit.org · 2 months ago

I am constantly on how I can allow Uppercase, without significantly reducing the posiible amounts of chars

magic_lobster_party@fedia.io · 2 months ago

Well it’s certainly possible to fit both uppercase and lowercase + 11 additional characters inside an int (26 + 26 + 11 = 63). The you need a null terminating char, which adds up to 64, which is 6 bits.

So all you need is 6 bits per char. 6 * 5 = 30, which is less than 32.

It’s easier to do this by thinking in binary rather than decimals. Look into bit shifting and other bitwise operations.

lad@programming.dev · 2 months ago

Depending on the use-case you might also want to add special case value like @Redkey@programming.dev did in their example, and get kind of UTF-8 pages. Then you can pack lowercase to 5 bits, and uppercase and some special symbols to 10 bits, and it will be smaller if uppercase are rare

Cethin@lemmy.zip · edit-2 2 months ago

If you’re ever doing optimizations like this, always think in binary. You’re causing yourself more trouble by thinking in decimal. With n bits you can represent 2^n different results. Using this you can figure out how many bits you need to store however many different potential characters. 26 letters can be stored in 5 bits, with 6 extra possibilities. 52 letters can be stored in 6 bits, with 12 extra possibilities. Assuming you want an end of string character, you have 11 more options.

If you want optimal packing, you could pack this into 48 bits, or 6 bytes/chars, for 8 characters.

hdsrob@lemmy.world · 2 months ago

We have a binary file that has to maintain compatibility with a 16 bit Power Basic app that hasn’t been recompiled since '99 or '00. We have storage for 8 character strings in two ints , and 12 character string in two ints and two shorts.

da_cow (she/her)@feddit.org · 2 months ago

Damn, that are setups where you have to get creative.

sunbeam60@lemmy.ml · edit-2 2 months ago

After all… Why not?

Why shouldn’t I ignore the 100+ cultures whose character set couldn’t fit into this encoding?

MonkeMischief@lemmy.today · edit-2 2 months ago

Not 100% relevant but it was in my collection and I thought it was close enough to be funny. :D

JohnEdwa@sopuli.xyz · 2 months ago

ŚĆŻRŹĘĄMŚ

SubArcticTundra@lemmy.ml · 2 months ago

So did ASCII

SpaceCowboy@lemmy.ca · 2 months ago

They left one bit for the other cultures use.

sunbeam60@lemmy.ml · 2 months ago

Yes, which is why we’ve broadly accepted that ASCII isn’t sufficient any more.

Valmond@lemmy.world · 2 months ago

Åååååå!

carrylex@lemmy.world · 2 months ago

Valmond@lemmy.world · 2 months ago

CPU still pulls a 32kb block from RAM…

enumerator4829@sh.itjust.works · 2 months ago

Lol, using RAM like last century. We have enough L3 cache for a full linux desktop in cache. Git gud and don’t miss it (/s).

(As an aside, now I want to see a version of puppylinux running entirely in L3 cache)

BartyDeCanter@lemmy.sdf.org · 2 months ago

I decided to take a look and my current CPU has the same L1 as my high school computer had total RAM. And the L3 is the same as the total for the machine I built in college. It should be possible to run a great desktop environment entirely in L3.

BartyDeCanter@lemmy.sdf.org · 2 months ago

Look at this guy with their fancy RAM caches.

DacoTaco@lemmy.world · edit-2 2 months ago

Cache man, its a fun thing. ~~32k~~ 32 (derp, 32 not 32k) is a common cache line size. Some compilers realise that your data might be hit often and aligns it to a cache line start to make its access fast and easy. So yes, it might allocate more memory than it should need, but then its to align the data to something like a cache line.
There is also a hardware reasons that might also be the case. I know the wii’s main processor communicates with the co processor over memory locations that should be 32k aligned because of access speed, not only because of cache. Sometimes, more is less :')

Hell, might even be a cause of instruction speed that loading and handling 32k of data might be faster than a single byte :').

Then there is also the minimum heap allocation size that might factor in. Though a 32k minimum memory block seems… Excessive xD

Victor@lemmy.world · 2 months ago

Cache Man, I would watch that movie.

ChaoticNeutralCzech@feddit.org · 2 months ago

unsigned int turn_char_to_int(char pChar)
{
    switch(pChar)
    {
    case 'a':
        return 10;
    case 'b':
        return 11;
    case 'c':
        return 12;
    case 'd':
        return 13;
    case 'e':
        return 14;
    case 'f':
        return 15;
    case 'g':
        return 16;
    case 'h':
        return 17;
    case 'i':
        return 18;
    case 'j':
        return 19;
    case 'k':
        return 20;
    case 'l':
        return 21;
    case 'm':
        return 22;
    case 'n':
        return 23;
    case 'o':
        return 24;
    case 'p':
        return 25;
    case 'q':
        return 26;
    case 'r':
        return 27;
    case 's':
        return 28;
    case 't':
        return 29;
    case 'u':
        return 30;
    case 'v':
        return 31;
    case 'w':
        return 32;
    case 'x':
        return 33;
    case 'y':
        return 34;
    case 'z':
        return 35;
    case ' ':
        return 36;
    case '.':
        return 37;

    }
}

Are you a monster or just stupid?

da_cow (she/her)@feddit.org · 2 months ago

Just stupid

ChaoticNeutralCzech@feddit.org · edit-2 2 months ago

If you couldn’t write

if(pChar >= 'a' && pChar <= 'z') return pChar - ('a' - 10);

I suppose you typed this “all the size of a lookup table with none of the speed” abomination manually too.

Zacryon@feddit.org · 2 months ago

switch case structures are very efficient in c and c++. They work similarly like an offset into memory. Compute the offset once (any expression in the ‘case’ lines), then jump. Using primitives directly, like here with chars, is directly the offset. Contrary to if-else branches, where each case must be evaluated first and the CPU has basically no-op cycles in the pipeline until the result of the branch is known. If it fails, it proceeds with the next one, waits again etc… (Some CPU architectures might have stuff like speculative branch execution, which can speed this up.)

However, code-style wise this is really not elegant and something like your proposal or similar would be much better.

ChaoticNeutralCzech@feddit.org · edit-2 2 months ago

Oh, I didn’t know that they were a LUT of jump addresses. Stil, a LUT of values would be more space-efficient and likely faster. Also, what if the values are big and sparse, e.g.

switch (banknoteValue) {
    case 5000:
        check_uv();
        check_holograph();
    case 2000:
        check_stripe();
    case 1000:
        check_watermark();
}

…does the compiler make it into an if-else-like machine code instead?

InternetCitizen2@lemmy.world · 2 months ago

Yes I know, the code is probably bad, but I do not care

That’s why we love it.

Garbagio@lemmy.zip · 2 months ago

deleted by creator

aarch64@programming.dev · 2 months ago

That’s where std::vector<bool> or bitfields come in handy!

Jankatarch@lemmy.world · edit-2 2 months ago

An enum and a logical “or” operator.