32B is a good choice of size, as it allows running on a 24GB consumer card at ~4 bpw (RTX 3090/4090) while using most of the VRAM. Unlike llama 3.1, which had 8b, 70B (much too big to fit), and 405B.
It would appear to have been a U.S.-only game until now. As Eric Schmidt said in the YouTube lecture (that keeps getting pulled down), LLM's have been a rich-companies game.
We are lucky that Alibaba, Meta and Mistral sees some strategic value in public releases. If we it was just one of them, it would be a fragile situation for downstream startups. And they’re even situated in three different countries.
Good at what? It's great at breaking down complex problems into small, logical steps. Claude Sonnet 3.5 is still the best for coding. They can be leveraged together by using Aider's architect mode. It gives your request to the "architect" model first and returns a list of steps to implement your idea, but it doesn't write any code at that point. You then approve this and it sends the list to the coding model to actually write the code. This technique creates better quality code than any one model by itself. In Aider, you can assign any model you want as the architect and any other model as the coder. It's really great and I'm looking forward to the AI coding extensions for VSCode doing the same thing since I prefer to work in VSC than on the command line as is necessary with Aider.
My only real problem with o1 is that it's ridiculously expensive, to the point that it makes no sense to use it for actual code. In architect mode, however, you can keep the costs under control as there are far fewer input/output tokens.
I haven’t been super impressed with it, and haven’t encountered any practical tasks I wanted to solve with an LLM where o1 worked any better than prompting 4o or Sonnet to use more extensive CoT.
There might be some narrow band of practical problems in between what other LLMs can do and what o1 can’t, but I don’t think that really matters for most use cases, especially given how much slower it is.
Day to day, you just don’t really want to prompt a model near the limits of its capabilities, because success quickly becomes a coin flip. So if a model needs five times as long to work, it needs to dramatically expand the range of problems that can be solved reliably.
I think the true edge of CoT models will come from layman usability. While I can easily prompt Claude for examples and then manually modify the code to fill in the gaps, general domain knowledge and technical understanding is absolutely required from the human sitting in front of the screen. With o1, a layman can sit in front of the computer, and ask 'I want a website for tracking deliveries for my webshop and make it pretty', and the model will do it.
So it's not so much about increased capability, but removing the expert human in the loop.
My understanding was that the metric for LMArena is that one answer is “better” than another, for a deliberately 100% subjective definition of better.
My experience has been that typical LLMs will have more “preamble” to what they say, easing the reader (and priming themselves autoregressively) into answers with some relevant introduction of the subject, sometimes justifying the rationale and implications behind things. But for o1, that transient period and the underlying reasoning behind things is part of OpenAI’s special sauce, and they deliberately and aggressively take steps to hide it from users.
o1 will get correct answers to hard problems more often than other models (look at the math/coding/hard subsections on the leaderboard, where anecdotal experiences aside, it is #1), and there’s a strong correlation between correctness and a high score in those domains because getting code or math “right” matters more than the justification or explanation. But in more general domains where there isn’t necessarily an objective right or wrong, I know the vibe matters a lot more to me, and that’s something o1 struggles with.
reply