Alibaba releases an 'open' challenger to OpenAI's O1 reasoning model

alwayslikethis · 25 minutes ago

32B is a good choice of size, as it allows running on a 24GB consumer card at ~4 bpw (RTX 3090/4090) while using most of the VRAM. Unlike llama 3.1, which had 8b, 70B (much too big to fit), and 405B.

reply

TrainedMonkey · 7 minutes ago

So western controls on training hardware (hello NVIDIA) seem to have failed. I wonder if there will be any repercussions here.

reply

seanmcdirmid · 5 minutes ago

Or they could be training the models in the states? It’s hard to say since alibaba does R&D in Bellevue as well as Hangzhou.

reply

echelon · 2 minutes ago

Most of the papers in machine learning are coming from China. The vast majority.

Most of the open source models on GitHub, too.

China is dominating this field.

reply

imdsm · 1 minute ago

Are they making breakthroughs or are they taking what exists and copying/marginally improving?

reply

foundry27 · 51 minutes ago

Bigger discussion from two days ago: https://news.ycombinator.com/item?id=42259184

reply

swyx · 37 minutes ago

full list of chinese o1 clones

- QwQ https://huggingface.co/collections/Qwen/qwq-674762b79b75eac0...

- Marco-o1 by AIDC, Alibaba (yes this is a different, less known team also from alibaba) https://huggingface.co/AIDC-AI/Marco-o1

- Skywork-o1 by Kunlun Tech https://huggingface.co/collections/Skywork/skywork-o1-open-6...

- DeepSeek-R1-Lite-Preview https://chat.deepseek.com

- InternThinker preview by Shanghai AI lab https://sso.openxlab.org.cn/login?redirect=https://internlm-...

- k0-math by Moonshot AI https://kimi.moonshot.cn

https://x.com/adinayakup/status/1861908631807017007?s=46

the main ones to watch are QwQ and r1.

reply

JKCalhoun · 1 hour ago

Is Alibaba's LLM the "Chinese LLM"?

It would appear to have been a U.S.-only game until now. As Eric Schmidt said in the YouTube lecture (that keeps getting pulled down), LLM's have been a rich-companies game.

reply

whimsicalism · 52 minutes ago

you only think that because you haven’t been paying close attention

qwen, deepseek, yi - there have been a number of high quality, open chinese competitors

reply

JKCalhoun · 30 minutes ago

Thanks. You're right, I am a layman. I may also have been focusing on the "open" LLMs since they seem to get the most talked about on HN.

reply

teaearlgraycold · 29 minutes ago

And AI21, which is Israeli

reply

whimsicalism · 14 minutes ago

it’s a good model, but not in the same class as the ones i just named

reply

jimmySixDOF · 25 minutes ago

And UAE's TII with Flacon

reply

whimsicalism · 17 minutes ago

that model is undertrained and kinda sucks

reply

lokimedes · 1 hour ago

Deepmind is in the UK and Mistral is in France?

Alibaba has been pumping out a bunch of useful models for a long time.

reply

ddtaylor · 30 minutes ago

What lecture is this?

reply

mistrial9 · 38 minutes ago

no, it is the opposite. China had versions of LLMs since before they were widely public. see the LLama family history chart for one example

reply

lokimedes · 1 hour ago

We are lucky that Alibaba, Meta and Mistral sees some strategic value in public releases. If we it was just one of them, it would be a fragile situation for downstream startups. And they’re even situated in three different countries.

reply

ChrisArchitect · 34 minutes ago

[dupe] Discussion here: https://news.ycombinator.com/item?id=42259184

reply

nextworddev · 37 minutes ago

Is their repo / model free of any undisclosed telemetry, ie is it purely weights

reply

anonym29 · 30 minutes ago

Is it even possible to embed telemetry into a model itself, as opposed to the runtime environment / program (e.g. Ollama)?

I would be disinclined to believe that to be possible, but if anyone knows otherwise, please share.

reply

evilduck · 3 minutes ago

That's literally why the safetensor format exists. The previous pickle (ckpt) format allowed for arbitrary code execution.

reply

tetris11 · 45 minutes ago

I can't wait for Ebay to release theirs

reply

JKCalhoun · 29 minutes ago

Not Amazon?

reply

fxj · 46 minutes ago

I am right now playing with it running it locally using ollama. It is a 19GB download and it runs nicely on a nvidia A100 GPU.

https://ollama.com/library/qwq

reply

SubiculumCode · 15 minutes ago

A100 geez. The privileged few.

reply

delusional · 29 minutes ago

Runs nicely on my AMD 7900XTX too.

reply

zb3 · 1 hour ago

Is o1 even that good? It's doesn't even rank first on LMArena..

reply

IAmGraydon · 21 minutes ago

Good at what? It's great at breaking down complex problems into small, logical steps. Claude Sonnet 3.5 is still the best for coding. They can be leveraged together by using Aider's architect mode. It gives your request to the "architect" model first and returns a list of steps to implement your idea, but it doesn't write any code at that point. You then approve this and it sends the list to the coding model to actually write the code. This technique creates better quality code than any one model by itself. In Aider, you can assign any model you want as the architect and any other model as the coder. It's really great and I'm looking forward to the AI coding extensions for VSCode doing the same thing since I prefer to work in VSC than on the command line as is necessary with Aider.

My only real problem with o1 is that it's ridiculously expensive, to the point that it makes no sense to use it for actual code. In architect mode, however, you can keep the costs under control as there are far fewer input/output tokens.

reply

mistercow · 1 hour ago

I haven’t been super impressed with it, and haven’t encountered any practical tasks I wanted to solve with an LLM where o1 worked any better than prompting 4o or Sonnet to use more extensive CoT.

There might be some narrow band of practical problems in between what other LLMs can do and what o1 can’t, but I don’t think that really matters for most use cases, especially given how much slower it is.

Day to day, you just don’t really want to prompt a model near the limits of its capabilities, because success quickly becomes a coin flip. So if a model needs five times as long to work, it needs to dramatically expand the range of problems that can be solved reliably.

reply

torginus · 46 minutes ago

I think the true edge of CoT models will come from layman usability. While I can easily prompt Claude for examples and then manually modify the code to fill in the gaps, general domain knowledge and technical understanding is absolutely required from the human sitting in front of the screen. With o1, a layman can sit in front of the computer, and ask 'I want a website for tracking deliveries for my webshop and make it pretty', and the model will do it.

So it's not so much about increased capability, but removing the expert human in the loop.

reply

whimsicalism · 51 minutes ago

yes, it’s extremely good.

don’t overindex on the lmsys arena, the median evaluator is kinda mid

reply

foundry27 · 56 minutes ago

My understanding was that the metric for LMArena is that one answer is “better” than another, for a deliberately 100% subjective definition of better.

My experience has been that typical LLMs will have more “preamble” to what they say, easing the reader (and priming themselves autoregressively) into answers with some relevant introduction of the subject, sometimes justifying the rationale and implications behind things. But for o1, that transient period and the underlying reasoning behind things is part of OpenAI’s special sauce, and they deliberately and aggressively take steps to hide it from users.

o1 will get correct answers to hard problems more often than other models (look at the math/coding/hard subsections on the leaderboard, where anecdotal experiences aside, it is #1), and there’s a strong correlation between correctness and a high score in those domains because getting code or math “right” matters more than the justification or explanation. But in more general domains where there isn’t necessarily an objective right or wrong, I know the vibe matters a lot more to me, and that’s something o1 struggles with.

reply

Tostino · 1 hour ago

For very specific tasks, yeah it's good. For my usual coding tasks...no way.

reply