[Rumor] Shipping Listing Suggests 24GB+ Intel Arc B580

brucethemoose@lemmy.world · edit-2 18 hours ago

What are bitnet models and what does that change in a nutshell?

Read the pitch here: https://github.com/ridgerchu/matmulfreellm

Basically, using ternary weights, all inference-time matrix multiplication can be replaced with much simpler matrix addition. This is theoretically more efficient on GPUs, and astronomically more efficient on dedicated hardware (as adders take up a fraction of the space as multipliers in silicon). This would be particularly fantastic for, say, local inference on smartphones or laptop ASICs.

The catch is no one has (publicly) risked a couple of million dollars to test it with a large model, as (so far) training it isn’t more efficient than “regular” LLMs.

Doesn’t Open AI just have the same efficiency issue as computing in general due to hardware from older nodes?

No one really knows, because they’re so closed and opaque!

But it appears that their models perform relatively poorly for thier “size.” Qwen is nearly matching GPT-4 in some metrics, yet is probably an order of magnitude smaller, while Google/Claude and some Chinese models are also pulling ahead.

brucethemoose@lemmy.world · edit-2 19 hours ago

The environmental cost of training is a bit of a meme. The details are spread around, but basically, Alibaba trained a GPT-4 level-ish model on a relatively small number of GPUs… probably on par with a steel mill running for a long time, a comparative drop in the bucket compared to industrial processes. OpenAI is extremely inefficient, probably because they don’t have much pressure to optimize GPU usage.

Inference cost is more of a concern with crazy stuff like o3, but this could dramatically change if (hopefully when) bitnet models come to frutition.

Still, I 100% agree with this. Closed LLM weights should be public domain, as many good models already are.

brucethemoose@lemmy.world · edit-2 3 days ago

To be fair, BG3 is like bottled lightning, and I think it’s unreasonable to expect many (if any) other studios to produce something like that.

Even the Divinity games were way above par, with a much more lukewarm (but not unsuccessful, I guess?) reception.

brucethemoose@lemmy.world · 4 days ago

Almost certainly not. The A770 is built like an “upper midrange” GPU while the B580 is a smaller die.

If there’s ever an B770 or whatever, maybe consider it.

If you’re using them for running like coder llms though, that’s a different story.

brucethemoose@lemmy.world · edit-2 5 days ago

It uses embedded LPDDR5X, so it will not be upgradeable unless the mobo/laptop maker uses LPCAMMs.

And… that’s kinda how it has to be. Laptop SO-DIMMs are super slow due to the design of the DIMMs, and they need crazy voltages to even hit the speeds/timings they run at now.

brucethemoose@lemmy.world · edit-2 5 days ago

We effectively can if we threaten to pull all support and harass Ukraine instead…

Not that I want that, or have any say in that as a US citizen…

brucethemoose@lemmy.world · 5 days ago

You could list it locally depending on where you are, through FB marketplace or Craigslist.

Otherwise, yeah, eBay.

brucethemoose@lemmy.world · edit-2 6 days ago

They’re kinda already there :(. Maybe even worse than raspberry pies.

Intel has all but said they’re exiting the training/research market.

AMD has great hardware, but the MI300X is not gaining traction due to a lack of “grassroots” software support, and they were too stupid to undercut Nvidia and sell high vram 7900s to devs, or to even prioritize its support in rocm. Same with their APUs. For all the marketing, they just didn’t prioritize getting them runnable with LLM software

brucethemoose@lemmy.world · edit-2 6 days ago

Your OS uses it efficiently, but fundamentally it also limits what app developers can do. They have to make apps with 2-6GB in mind.

Not everything needs a lot of RAM, but LLMs are absolutely an edge case where “more is better, and there’s no way around it,” and they aren’t the only one.

brucethemoose@lemmy.world · edit-2 6 days ago

It’s just smarter with the same number of parameters. Try Qwen QwQ or Qwen coder 32B, see for yourself… it stacks up well against huge models like the 123B Mistral Large, or even GPT-4.

Why? Alibaba trained it well, presumably with better data than OpenAI or whomever else, though specifics are up for debate. Some suggests that bilingual training on English/Chinese (aka the two largest text corpuses in existance) significantly helps the model over mostly english. Some say the government just gave them better data. There’s also suggestions that having so few GPUs compared to American AI companies made the Chinese “thrifty,” and gave them far more incentive to be innovative rather than brute forcing models (which has diminishing returns).

brucethemoose@lemmy.world · 6 days ago

My old Razer Phone 2 (circa 2019) shipped with 8GB RAM, and that (and the 120hz display) made it feel lighting fast until I replaced it last week, and only because the microphone got gunked up with dust.

Your iPhone 14 Pro has 6GB of RAM. Its a great phone (I just got a 16 plus on a deal), but that will significantly shorten its longevity.

brucethemoose@lemmy.world · edit-2 6 days ago

B580 24GB and B770 32GB

They would be incredible, as long as they’re cheap. Intel would be utterly stupid for not doing this.

With OpenAPI being backed by so many big names, do you think they will be able to upset CUDA in the future or has Nvidia just become too entrenched?

OpenAI does not make hardware. Also, their model progress has stagnated, already matched or surpassed by Google, Claude, and even some open source chinese models trained on far fewer GPUs… OpenAI is circling the drain, forget them.

The only “real” competitor to Nvidia, IMO, is Cerebras, which has a decent shot due to a silicon strategy Nvidia simply does not have.

The AMD MI300X is actually already “better” than Nvidia’s best… but they just can’t stop shooting themselves in the foot, as AMD does. Google TPUs are good, but google only, like Amazon’s hardware. I am not impressed with Groq or Tenstorrent.

brucethemoose@lemmy.world · edit-2 6 days ago

Yeah, but they not worth it.

The 4090 is basically just as good as the 3090 because it has the same amount of vram, but twice the price… so you mind as well get 2x 3090s.

The 5090 will be hilariously expensive, and 24GB -> 32GB is not that great, as you still can’t run 70B class models in that pool… again, mind as well get 2x 3090s. I would not even bother trading my single 3090 for 5090.

If AMD sold a 48GB consumer card, you would see them dominate the open source LLM space in a month, because every single backend dev would buy one and get their projects working on them. Same with Intel. VRAM is basically the only thing that matters, and 24GB is kinda pitiful at a 4090’s price.

brucethemoose@lemmy.world · edit-2 6 days ago

Its complicated.

So there’s Intel’s own project/library, which is the fastest way to run LLMs on their IGPs and GPUs. But also the hardest to set up, and the least feature packed.

There’s more than one Intel compatible llama.cpp ‘backend,’ including the Intel-contribed SYCL one, another PR for the AMX support on CPUs, I think another one branded as ipex-llm, and the vulkan backend that the main llama.cpp devs seem to be focusing on now. The problem is each of these backends have their own bugs, incomplete features, installation quirks, and things they don’t support, while AMD’s rocm kinda “just works” because it inherits almost everything from the CUDA backend.

It’s a hot mess.

Hardcore LLM enthusiasts largely can’t keep up, much less the average person just trying to self-host a model.

OneAPI is basically a nothingburger so far. You can run many popular CUDA libraries on AMD through rocm, right now, but that is not the case with Intel, and no devs are interested in changing that because Intel isn’t selling any “3090 class” GPU hardware worth buying.

brucethemoose@lemmy.world · edit-2 6 days ago

Yeah… and it kinda sucks because it’s small.

If Apple shipped with 16GB/24GB like some Android phones did well before the iPhone 16, it would be far more useful. 16-24GB (aka 14B-32B class models) are the current threshold where quantized LLMs really start to feel ‘smart,’ and they could’ve continue trained a great Apache 2.0 model instead of a tiny, meager one from scratch.

brucethemoose@lemmy.world · 6 days ago

It would be a really stupid business decision for them not to.

brucethemoose@lemmy.world · edit-2 6 days ago

In practice, almost no one with A770s uses ipex-llm simply because its not as vram efficient as llama.cpp, isn’t as feature rich, and the PyTorch setup is nightmarish.

Intel is indeed making many contributions to the open source LLM space, but it feels… shotgunish? Not unified at all. AMD, on the other hand, is more focused but woefully understaffed, and Nvidia is laser focused on the enterprise space.

brucethemoose@lemmy.world · 6 days ago

Not if they’re anti vax too. I know a nurse, a family friend, who was anti vax during covid…

brucethemoose@lemmy.world · edit-2 6 days ago

[Rumor] Shipping Listing Suggests 24GB+ Intel Arc B580

brucethemoose@lemmy.world · edit-2 6 days ago

Qwen 2.5 32B is where it’s at now. 24GB is affordable, and it fits perfectly.

Otherwise, stay on the lookout for AMD Strix Halo, which can reportedly allocate up to 96GB on its IGP, and you can run faster backends like vllm or exllama.

brucethemoose@lemmy.world · 6 days ago

You should think about selling it TBH. 3090 prices are shooting up like crazy, and may be at a peak, because they are the last affordable card to self host LLMs.