Hacker News Clone

Hacker News Clone new | comments | show | ask | jobs | submit | github repo

		OK, I can partly explain the LLM chess weirdness now (dynomight.net)
		433 points by dmazin 1 day ago \| hide \| past \| web \| 400 comments \| favorite

wavemode 9 hours ago [-]

I have the exact same problem with this article that I had with the previous one - the author fails to provide any data on the frequency of illegal moves.

Thus it's impossible to draw any meaningful conclusions. It would be similar to if I claimed that an LLM is an expert doctor, but in my data I've filtered out all of the times it gave incorrect medical advice.

rcxdude 3 hours ago [-]

I don't think is super relevant. I mean, it would be interesting (especially if there was a meaningful difference in the number of illegal move attempts between the different approaches, doubly so if that didn't correlate with the performance when illegal moves are removed), but I don't think it really affects the conclusions of the article: picking randomly from the set of legal moves makes for a truly terrible chess player, so clearly the LLMs are bringing something to the party such that sampling from their output performs significantly better. Splitting hairs about the capability of the LLM on its own (i.e. insisting on defining attempts at an illegal move as a game loss for the purposes of rating) seems pretty besides the point.

timjver 8 hours ago [-]

> It would be similar to if I claimed that an LLM is an expert doctor, but in my data I've filtered out all of the times it gave incorrect medical advice.

Computationally it's trivial to detect illegal moves, so it's nothing like filtering out incorrect medical advice.

KK7NIL 5 hours ago [-]

> Computationally it's trivial to detect illegal moves

You're strictly correct, but the rules for chess are infamously hard to implement (as anyone who's tried to write a chess program will know), leading to minor bugs in a lot of chess programs.

For example, there's this old myth about vertical castling being allowed due to ambiguity in the ruleset: https://www.futilitycloset.com/2009/12/11/outside-the-box/ (Probably not historically accurate).

If you move beyond legal positions into who wins when one side flags, the rules state that the other side should be awarded a victory if checkmate was possible with any legal sequence of moves. This is so hard to check that no chess program tries to implement it, instead using simpler rules to achieve a very similar but slightly more conservative result.

adelineJoOs 20 minutes ago [-]

That link was new too me, thanks! However: I wrote some chess-program myself (nothing big, hobby level) and I would not call it hard to implement. Just harder than what someone might assume initially. But in the end, it is one of the simpler simulations/algorithms I did. It is just the state of the board, the state of the game (how many turns, castle rights, past positions for the repetition rule, ...) and picking one rule set if one really wants to be exact.

(thinking about which rule set is correct would not be meaningful in my opinion - chess is a social construct, with only parts of it being well defined. I would not bother about the rest, at least not when implementing it)

By the way: I read "Computationally it's trivial" as more along the lines of "it has been done before, it is efficient to compute, one just has to do it" versus "this is new territory, one needs to come up with how to wire up the LLM output with an SMT solver, and we do not even know if/how it will work."

admax88qqq 2 hours ago [-]

> You're strictly correct, but the rules for chess are infamously hard to implement

Come on. Yeah they're not trivial but they've been done numerous times. There's been chess programs for almost as long as there have been computers. Checking legal moves is a _solved problem_.

Detecting valid medical advice is not. The two are not even remotely comparable.

KK7NIL 49 minutes ago [-]

> Detecting valid medical advice is not. The two are not even remotely comparable.

Uh? Where exactly did I signal my support for LLM's giving medical advice?

rco8786 3 hours ago [-]

I got a kick out of that link. Had certainly never heard of "vertical castling" previously.

wavemode 7 hours ago [-]

As I wrote in another comment - you can write scripts that correct bad math, too. But we don't use that to claim that LLMs have a good understanding of math.

ben_w 6 hours ago [-]

I'd say that's because we don't understand what we mean by "understand".

Hardware that accurately performs maths faster than all of humanity combined is so cheap as to be disposable, but I've yet to see anyone claim that a Pi Zero has "understanding" of anything.

An LLM can display the viva voce approach that Turing suggested[0], and do it well. Ironically for all those now talking about "stochastic parrots", the passage reads:

"""… The game (with the player B omitted) is frequently used in practice under the name of viva voce to discover whether some one really understands something or has ‘learnt it parrot fashion’. …"

Showing that not much has changed on the philosophy of this topic since it was invented.

[0] https://academic.oup.com/mind/article/LIX/236/433/986238

SpaceManNabs 3 hours ago [-]

I don't know. I have talked to a few math professors, and they think LLMs are as good as a lot of their peers when it comes hallucinations and being able to discuss ideas on very niche topics, as long as the context is fed in. If Tao is calling some models "a mediocre, but not completely incompetent [...] graduate student", then they seem to understand math to some degree to me.

lupire 1 hour ago [-]

Tao said that about a model brainstorming ideas that might be useful, not explaining complex ideas or generating new ideas or selecting a correct idea from a list of brainstormed ideas. Not replacing a human.

adelineJoOs 33 minutes ago [-]

> Not replacing a human.

Obviously not, but that is tangential to this discussion, I think. A hammer might be a useful tool in certain situations, and surely it does not replace a human (but it might make a human in those situations more productive, compared to a human without a hammer).

> generating new ideas

Is brainstorming not an instance of generating new ideas? I would strongly argue so. And whether the LLM does "understand" (or whatever ill-defined, ill-measurable concept one wants to use here) anything about the ideas if produces, and how they might be novel - that is not important either.

If we assume that Tao is adequately assessing the situation and truthfully reporting his findings, then LLMs can, at the current state, at least occasionally be useful in generating new ideas, at least in mathematics.

sigmar 8 hours ago [-]

Don't think that analogy works unless you could write a script that automatically removes incorrect medical advice, because then you would indeed have an LLM-with-a-script that was an expert doctor (which you can do for illegal chess move, but obviously not for evaluating medical advice)

wavemode 8 hours ago [-]

You can write scripts that correct bad math, too. In fact most of the time ChatGPT will just call out to a calculator function. This is a smart solution, and very useful for end users! But, still, we should not try to use that to make the claim that LLMs have a good understanding of math.

afro88 5 hours ago [-]

If a script were applied that corrected "bad math" and now the LLM could solve complex math problems that you can't one-shot throw at a calculator, what would you call it?

sixfiveotwo 5 hours ago [-]

It's a good point.

But this math analogy is not quite appropriate: there's abstract math and arithmetic. A good math practitioner (LLM or human) can be bad at arithmetic, yet good at abstract reasoning. The later doesn't (necessarily) requires the former.

In chess, I don't think that you can build a good strategy if it relies on illegal moves, because tactics and strategies are tied.

vunderba 5 hours ago [-]

Agreed. It's not the same thing and we should strive for precision (LLMs are already opaque enough as it is).

An LLM that recognizes an input as "math" and calls out to a NON-LLM to solve the problem vs an LLM that recognizes an input as "math" and also uses next-token prediction to produce an accurate response ARE DIFFERENT.

henryfjordan 6 hours ago [-]

At what point does "knows how to use a calculator" equate to knowing how to do math? Feels pretty close to me...

Tepix 6 hours ago [-]

Well, LLMs are bad at math but they're ok at detecting math and delegating it to a calculator program.

It's kind of like humans.

kcbanner 8 hours ago [-]

It would be possible to employ an expert doctor, instead of writing a script.

ben_w 6 hours ago [-]

Which is cheaper:

1. having a human expert creating every answer

or

2. having an expert check 10 answers each of which have a 90% chance of being right and then manually redoing the one which was wrong

Now add a complications that:

• option 1 also isn't 100% correct

• nobody knows which things in option 2 are correlated or not and if those are or aren't correlated with human errors so we might be systematically unable to even recognise the errors

• even if we could, humans not only get lazy without practice but also get bored if the work is too easy, so a short-term study in efficiency changes doesn't tell you things like "after 2 years you get mass resignations by the competent doctors, while the incompetent just say 'LGTM' to all the AI answers"

og_kalu 8 hours ago [-]

3-turbo-instruct makes about 5 or less illegal moves in 8205. It's not here but turbo instruct has been evaled before.

https://github.com/adamkarvonen/chess_gpt_eval

hansvm 1 hour ago [-]

There's a subtle distinction though; if you're able to filter out illegal behavior, the move quality conditioned on legality can be extremely different from arbitrary move quality (and, as you might see in LLM json parsing, conditioning per-token can be very different from conditioning per-response).

If you're arguing that the singularity already happened then your criticism makes perfect sense; these are dumb machines, not useful yet for most applications. If you just want to use the LLM as a tool though, the behavior when you filter out illegal responses (assuming you're able to do so) is the only reasonable metric.

Analogizing to a task I care a bit about: Current-gen LLMs are somewhere between piss-poor and moderate at generating recipes. With a bit of prompt engineering most recipes pass my "bar", but they're still often lacking in one or more important characteristics. If you do nothing other than ask it to generate many options and then as a person manually filter to the subset of ideas (around 1/20) which look stellar, it's both very effective at generating good recipes, and they're usually much better than my other sources of stellar recipes (obviously not generally applicable because you have to be able to tell bad recipes from good at a glance for that workflow to make sense). The fact that most of the responses are garbage doesn't really matter; it's still an improvement to how I cook.

GuB-42 7 hours ago [-]

> Thus it's impossible to draw any meaningful conclusions. It would be similar to if I claimed that an LLM is an expert doctor, but in my data I've filtered out all of the times it gave incorrect medical advice.

Not really, you can try to make illegal moves in chess, and usually, you are given a time penalty and get to try again, so even in a real chess game, illegal moves are "filtered out".

And for the "medical expert" analogy, let's say that you compare to systems based on the well being of the patients after they follow the advise. I think it is meaningful even if you filter out advise that is obviously inapplicable, for example because it refers to non-existing body parts.

koolala 7 hours ago [-]

I want to see graphs of moves the author randomly made too. Maybe even plotting a random-move player on the performance graphs vs. the AIs.

It's beginner chess and beginners make moves at random all the time.

benediktwerner 5 hours ago [-]

1750 elo is extremely far from beginner chess. The random mover bot on Lichess has like 700 rating.

And the article does show various graphs of the badly playing models which will hardly play worse than random but are clearly far below the good models.

theptip 8 hours ago [-]

This is a crazy goal-post move. TFA is proving a positive capability, and rejecting the null hypothesis that “LLMs can’t think they just regurgitate”.

Making some illegal moves doesn’t invalidate the demonstrated situational logic intelligence required to play at ELO 1800.

(Another angle: a human on Chess.com also has any illegal move they try to make ignored, too.)

photonthug 6 hours ago [-]

> Making some illegal moves doesn’t invalidate the demonstrated situational logic intelligence

That’s exactly what it does. 1 illegal move in 1 million or 100 million or any other sample size you want to choose means it doesn’t understand chess.

People in this thread are really distracted by the medical analogy so I’ll offer another: you’ve got a bridge that allows millions of vehicles to cross, and randomly falls down if you tickle it wrong, maybe a car of rare color. One key aspect of bridges is that they work reliably for any vehicle, and once they fail they don’t work with any vehicle. A bridge that sometimes fails and sometimes doesn’t isn’t a bridge as much as a death trap.

og_kalu 6 hours ago [-]

>1 illegal move in 1 million or 100 million or any other sample size you want to choose means it doesn’t understand chess

Highly rated chess players make illegal moves. It's rare but it happens. They don't understand chess ?

photonthug 5 hours ago [-]

> Then no human understands chess

Humans with correct models may nevertheless make errors in rule applications. Machines are good at applying rules, so when they fail to apply rules correctly, it means they have incorrect, incomplete, or totally absent models.

Without using a word like “understands” it seems clear that the same apparent mistake has different causes.. and model errors are very different from model-application errors. In a math or physics class this is roughly the difference between carry-the-one arithmetic errors vs using an equation from a completely wrong domain. The word “understands” is loaded in discussion of LLMs, but everyone knows which mistake is going to get partial credit vs zero credit on an exam.

og_kalu 5 hours ago [-]

>Humans with correct models may nevertheless make errors in rule applications. Ok

>Machines are good at applying rules, so when they fail to apply rules correctly, it means they have incorrect or incomplete models.

I don't know why people continue to force the wrong abstraction. LLMs do not work like 'machines'. They don't 'follow rules' the way we understand normal machines to 'follow rules'.

>so when they fail to apply rules correctly, it means they have incorrect or incomplete models.

Everyone has incomplete or incorrect models. It doesn't mean we always say they don't understand. Nobody says Newton didn't understand gravity.

>Without using a word like “understands” it seems clear that the same apparent mistake has different causes.. and model errors are very different from model-application errors.

It's not very apparent no. You've just decided it has different causes because of preconceived notions on how you think all machines must operate in all configurations.

LLMs are not the logic automatons in science fiction. They don't behave or act like normal machines in any way. The internals run some computations to make predictions but so does your nervous system. Computation is substrate-independent.

I don't even know how you can make this distinction without seeing what sort of illegal moves it makes. If it makes the sort high rated players make then what ?

sixfiveotwo 5 hours ago [-]

> Machines are good at applying rules, so when they fail to apply rules correctly, it means they have incorrect, incomplete, or totally absent models.

That's assuming that, somehow, a LLM is a machine. Why would you think that?

benediktwerner 5 hours ago [-]

Try giving a random human 30 chess moves and ask them to make a non-terrible legal move. Average humans even quite often try to make illegal moves when clearly seeing the board before them. There are even plenty of cases where people reported a bug because the chess application didn't let them do an illegal move they thought was legal.

And the sudden comparison to something that's safety critical is extremely dumb. Nobody said we should tie the LLM to a nuclear bomb that explodes if it makes a single mistake in chess.

The point is that it plays at a level far far above making random legal moves or even average humans. To say that that doesn't mean anything because it's not perfect is simply insane.

photonthug 4 hours ago [-]

> And the sudden comparison to something that's safety critical is extremely dumb. Nobody said we should tie the LLM to a nuclear bomb that explodes if it makes a single mistake in chess.

But it actually is safety critical very quickly whenever you say something like “works fine most of the time, so our plan going forward is to dismiss any discussion of when it breaks and why”.

A bridge failure feels like the right order of magnitude for the error rate and effective misery that AI has already quietly caused with biased models where one in a million resumes or loan applications is thrown out. And a nuclear bomb would actually kill less people than a full on economic meltdown. But I’m sure no one is using LLMs in finance at all right?

It’s so arrogant and naive to ignore failure modes that we don’t even understand yet.. at least bridges and steel have specs. Software “engineering” was always a very suspect name for the discipline but whatever claim we had to it is worse than ever.

wavemode 7 hours ago [-]

It's not a goalpost move. As I've already said, I have the exact same problem with this article as I had with the previous one. My goalposts haven't moved, and my standards haven't changed. Just provide the data! How hard can it be? Why leave it out in the first place?

sixo 7 hours ago [-]

When I play chess I filter out all kinds of illegal moves. I also filter out bad moves. Human is more like "recursively thinking of ideas and then evaluating them with another part of your model", why not let the LLMs do the same?

skydhash 6 hours ago [-]

Because that’s not what happens? We learn through symbolic meaning and rules which then form a consistent system. Then we can have a goal and continuously evaluate if we’re within the system and transitionning towards that goal. The nice thing is that we don’t have to compute the whole simulation in our brains and can start again from the real world. The more you train, the better your heuristics become and the more your efficiency increases.

The internal model of a LLM is statistical text. Which is linear and fixed. Not great other than generating text similar to what was ingested.

fl7305 4 hours ago [-]

> The internal model of a LLM is statistical text. Which is linear and fixed. Not great other than generating text similar to what was ingested.

The internal model of a CPU is linear and fixed. Yet, a CPU can still generate an output which is very different from the input. It is not a simple lookup table, instead it executes complex algorithms.

An LLM has large amounts of input processing power. It has a large internal state. It executes "cycle by cycle", processing the inputs and internal state to generate output data and a new internal state.

So why shouldn't LLMs be capable of executing complex algorithms?

skydhash 2 hours ago [-]

It probably can, but how will those algorithms be created? And the representation of both input and output. If it’s text, the most efficient way is to construct a formal system. Or a statistical model if ambiguous and incorrect result are ok in the grand scheme of things.

The issue is always inout consumption, and output correctness. In a CPU, we take great care with data representation and protocol definition, then we do formal verification on the algorithms, and we can be pretty sure that the output are correct. So the issue is that the internal model (for a given task) of LLMs are not consistent enough and the referential window (keeping track of each item in the system) is always too small.

hackinthebochs 6 hours ago [-]

>The internal model of a LLM is statistical text. Which is linear and fixed.

Not at all. Like seriously, not in the slightest.

skydhash 2 hours ago [-]

What does it encode? Images? Scent? Touch? Some higher dimensional qualia?

hackinthebochs 1 hour ago [-]

Well, a simple description is that they discover circuits that reproduce the training sequence. It turns out that in the process of this, they recover relevant computational structures that generalize the training sequence. The question of how far they generalize is certainly up for debate. But you can't reasonably deny that they generalize to a certain degree. After all, most sentences they are prompted on are brand new and they mostly respond sensibly.

Their representation of the input is also not linear. Transformers use self-attention which relies on the softmax function, which is non-linear.

falcor84 8 hours ago [-]

I world argue that it's more akin to filtering out the chit-chat with the patient, where the doctor explained things in an imprecise manner, keeping only the formal and valid medical notation

caddemon 8 hours ago [-]

There is no legitimate reason to make an illegal move in chess though? There are reasons why a good doctor might intentionally explain things imprecisely to a patient.

hnthrowaway6543 7 hours ago [-]

> There is no legitimate reason to make an illegal move in chess though?

If you make an illegal move and the opponent doesn't notice it, you gain a significant advantage. LLMs just have David Sirlin's "Playing to Win" as part of their training data.

ses1984 8 hours ago [-]

It’s like the doctor saying, “you have cancer? Oh you don’t? Just kidding. Parkinson’s. Oh it’s not that either? How about common cold?”

falcor84 8 hours ago [-]

Big the difference is that valid bad moves (equivalents of "cancer") were included in the analysis, it's only invalid ones (like "your body is kinda outgrowing itself") that were excluded from the analysis

ses1984 6 hours ago [-]

What makes a chess move invalid is the state of the board. I don’t think moves like “pick up the pawn and throw it across the room” were considered.

toast0 5 hours ago [-]

That's a valid move in Monopoly though. Although it's much prefered to pick up the table and throw it.

Der_Einzige 6 hours ago [-]

Correct - Dynamic grammar based/constrained sampling can be used to, at each time-step, force the model to only make valid moves (and you don't have to do it in the prompt like this article does!!!)

I have NO idea why no one seems to do this. It's a similar issue with LLM-as-judge evaluations. Often they are begging to be combined with grammar based/constrained/structured sampling. So much good stuff in LLM land isn't used for no good reason! There are several libraries for implementing this easily, outlines, guidance, lm-format-enforcer, and likely many more. You can even do it now with OpenAI!

Oobabooga text gen webUI literally has chess as one of it's candidate examples of grammar based sampling!!!

sourcepluck 13 hours ago [-]

> For one, gpt-3.5-turbo-instruct rarely suggests illegal moves, even in the late game.

It's claimed that this model "understands" chess, and can "reason", and do "actual logic" (here in the comments).

I invite anyone making that claim to find me an "advanced amateur" (as the article says of the LLM's level) chess player who ever makes an illegal move. Anyone familiar with chess can confirm that it doesn't really happen.

Is there a link to the games where the illegal moves are made?

grumpopotamus 11 hours ago [-]

I am an expert level chess player and I have multiple people around my level play illegal moves in classic time control games over the board. I have also watched streamers various levels above me try to play illegal moves repeatedly before realizing the UI was rejecting the move because it is illegal.

zoky 10 hours ago [-]

[delayed]

rgoulter 11 hours ago [-]

> I invite anyone making that claim to find me an "advanced amateur" (as the article says of the LLM's level) chess player who ever makes an illegal move. Anyone familiar with chess can confirm that it doesn't really happen.

This is somewhat imprecise (or inaccurate).

A quick search on YouTube for "GM illegal moves" indicates that GMs have made illegal moves often enough for there to be compilations.

e.g. https://www.youtube.com/watch?v=m5WVJu154F0 -- The Vidit vs Hikaru one is perhaps the most striking, where Vidit uses his king to attack Hikaru's king.

zarzavat 12 hours ago [-]

An LLM is essentially playing blindfold chess if it just gets the moves and not the position. You have to be fairly good to never make illegal moves in blindfold.

pera 11 hours ago [-]

A chat conversation where every single move is written down and accessible at any time is not the same as blindfold chess.

gwd 11 hours ago [-]

OK, but the LLM is still playing without a board to look at, except what's "in its head". How often would 1800 ELO chess players make illegal moves when playing only using chess notation over chat, with no board to look at?

What might be interesting is to see if there was some sort of prompt the LLM could use to help itself; e.g., "After repeating the entire game up until this point, describe relevant strategic and tactical aspects of the current board state, and then choose a move."

Another thing that's interesting is the 1800 ELO cut-off of the training data. If the cut-off were 2000, or 2200, would that improve the results?

Or, if you included training data but labeled with the player's ELO, could you request play at a specific ELO? Being able to play against a 1400 ELO computer that made the kind of mistakes a 1400 ELO human would make would be amazing.

zbyforgotp 11 hours ago [-]

You can make it available to the player and I suspect it wouldn’t change the outcomes.

lukeschlather 10 hours ago [-]

The LLM can't refer to notes, it is just relying on its memory of what input tokens it had.

fmbb 12 hours ago [-]

Does it not always have a list of all the moves in the game always at hand in the prompt?

You have to give this human the same log of the game to refer to.

xg15 11 hours ago [-]

I think even then it would still be blindfold chess, because humans do a lot of "pattern matching" on the actual board state in front of them. If you only have the moves, you have to reconstruct this board state in your head.

_heimdall 11 hours ago [-]

This is the problem with LLM researchers all but giving up on the problem of inspecting how the LLM actually works internally.

As long as the LLM is a black box, its entirely possible that (a) the LLM does reason through the rules and understands what moves are legal or (b) was trained on a large set of legal moves and therefore only learned to make legal moves. You can claim either case is the real truth, but we have absolutely no way to know because we have absolutely no way to actually understand what the LLM was "thinking".

codeulike 11 hours ago [-]

Here's an article where they teach an LLM Othello and then probe its internal state to assess whether it is 'modelling' the Othello board internally

https://thegradient.pub/othello/

Associated paper: https://arxiv.org/abs/2210.13382

mattmcknight 11 hours ago [-]

It's weird because it is not a black box at the lowest level, we can see exactly what all of the weights are doing. It's just too complex for us to understand it.

What is difficult is finding some intermediate pattern in between there which we can label with an abstraction that is compatible with human understanding. It may not exist. For example, it may be more like how our brain works to produce language than it is like a logical rule based system. We occasionally say the wrong word, skip a word, spell things wrong...violate the rules of grammar.

The inputs and outputs of the model are human language, so at least there we know the system as a black box can be characterized, if not understood.

_heimdall 10 hours ago [-]

> The inputs and outputs of the model are human language, so at least there we know the system as a black box can be characterized, if not understood.

This is actually where the AI safety debates tend to lose. From where I sit we can't characterize the black box itself, we can only characterize the outputs themselves.

More specifically, we can decide what we think the quality of the output for the given input and we can attempt to infer what might have happened in between. We really have no idea what happened in between, and though many of the "doomers" raise concerns that seem far fetched, we have absolutely no way of understanding whether they are completely off base or raising concerns of a system that just hasn't shown problems in the input/output pairs yet.

lukeschlather 10 hours ago [-]

> (a) the LLM does reason through the rules and understands what moves are legal or (b) was trained on a large set of legal moves and therefore only learned to make legal moves.

How can you learn to make legal moves without understanding what moves are legal?

_heimdall 10 hours ago [-]

I'm spit balling here so definitely take this with a grain of salt.

If I only see legal moves, I may not think outside the box come up with moves other than what I already saw. Humans run into this all the time, we see things done a certain and effectively learn that that's just how to do it and we don't innovate.

Said differently, if the generative AI isn't actually being generative at all, meaning its just predicting based on the training set, it could be providing only legal moves without ever learning or understanding the rules of the game.

mattmcknight 10 hours ago [-]

> I invite anyone making that claim to find me an "advanced amateur" (as the article says of the LLM's level) chess player who ever makes an illegal move.

I would say the analogy is more like someone saying chess moves aloud. So, just as we all misspeak or misspell things from time to time, the model output will have an error rate.

GaggiX 12 hours ago [-]

I can confirm that an advanced amateur can play illegal moves by playing blindfold chess as shown in this article.

tromp 1 day ago [-]

> For one, gpt-3.5-turbo-instruct rarely suggests illegal moves, even in the late game. This requires “understanding” chess.

Here's one way to test whether it really understands chess. Make it play the next move in 1000 random legal positions (in which no side is checkmated yet). Such positions can be generated using the ChessPositionRanking project at [1]. Does it still rarely suggest illegal moves in these totally weird positions, that will be completely unlike any it would have seen in training (and in which the legal move choice is often highly restricted) ?

While good for testing legality of next moves, these positions are not so useful for distinguishing their quality, since usually one side already has an overwhelming advantage.

[1] https://github.com/tromp/ChessPositionRanking

NitpickLawyer 1 day ago [-]

Interesting tidbit I once learned from a chess livestream. Even human super-GMs have a really hard time "scoring" or "solving" extremely weird positions. That is, positions that shouldn't come from logical opening - mid game - end game regular play.

It's absolutely amazing to see a super-GM (in that case it was Hikaru) see a position, and basically "play-by-play" it from the beginning, to show people how they got in that position. It wasn't his game btw. But later in that same video when asked he explained what I wrote in the first paragraph. It works with proper games, but it rarely works with weird random chess puzzles, as he put it. Or, in other words, chess puzzles that come from real games are much better than "randomly generated", and make more sense even to the best of humans.

lukan 15 hours ago [-]

"Even human super-GMs have a really hard time "scoring" or "solving" extremely weird positions. "

I can sort of confirm that. I never learned all the formal theoretical standard chess strategies except for the basic ones. So when playing against really good players, way above my level, I could win sometimes (or allmost) simply by making unconventional (dumb by normal strategy) moves in the beginning - resulting in a non standard game where I could apply pressure in a way the opponent was not prepared for (also they underestimated me after the initial dumb moves). For me, the unconventional game was just like a standard game, I had no routine - but for the experienced one, it was way more challenging. But then of course in the standard situations, to which allmost every chess game evolves to - they destroyed me, simply for experience and routine.

hhhAndrew 12 hours ago [-]

The book Chess for Tigers by Simon Webb explicitly advises this. Against "heffalumps" who will squash you, make the situation very complicated and strange. Against "rabbits", keep the game simple.

Reimersholme 12 hours ago [-]

In The Art of Learning, Joshua Waitzkin talks about how this was a strategy for him in tournaments as a child as well. While most other players were focusing on opening theory, he focused on end game and understanding how to use the different pieces. Then, by going with unorthodox openings, he could easily bring most players outside of their comfort zone where they started making mistakes.

saghm 1 day ago [-]

Super interesting (although it also makes some sense that experts would focus on "likely" subsets given how the number of permutations of chess games is too high for it to be feasible to learn them all)! That said, I still imagine that even most intermediate chess players would perfectly make only _legal_ moves in weird positions, even if they're low quality.

MarcelOlsz 1 day ago [-]

Would love a link to that video!

zbyforgotp 10 hours ago [-]

The problem is that the llm don’t learn to play moves from a position, the internet archives contain only game records. They might be building something to represent position internationally but it will not be automatically activated with an encoded chess position.

_heimdall 11 hours ago [-]

Would that be enough to prove it? If the LLM was trained only on a set of legal moves, isn't it possible that it functionally learned how each piece is allowed to move without learning how to actually reason about it?

Said differently in case I phrased that poorly - couldn't the LLM still learn the it only ever saw bishops move diagonally and therefore only considering those moves without actually reasoning through the concept of legal and illegal moves?

snowwrestler 1 day ago [-]

It’s kind of crazy to assert that the systems understand chess, and then disclose further down the article that sometimes he failed to get a legal move after 10 tries and had to sub in a random move.

A person who understands chess well (Elo 1800, let’s say) will essentially never fail to provide a legal move on the first try.

Certhas 15 hours ago [-]

What do you mean by "understand chess"?

I think you don't appreciate how good the level of chess displayed here is. It would take an average adult years of dedicated practice to get to 1800.

The article doesn't say how often the LLM fails to generate legal moves in ten tries, but it can't be often or the level of play would be much much much worse.

As seems often the case, the LLM seems to have a brilliant intuition, but no precise rigid "world model".

Of course words like intuition are anthropomorphic. At best a model for what LLMs are doing. But saying "they don't understand" when they can do _this well_ is absurd.

vundercind 11 hours ago [-]

> I think you don't appreciate how good the level of chess displayed here is. It would take an average adult years of dedicated practice to get to 1800.

Since we already have programs that can do this, that definitely aren’t really thinking and don’t “understand” anything at all, I don’t see the relevance of this part.

og_kalu 1 day ago [-]

He is testing several models, some of which cannot reliably output legal moves. That's different from saying all models including the one he thinks understands can't generate a legal move in 10 tries.

3.5-turbo-instruct's illegal move rate is about 5 or less in 8205

IanCal 12 hours ago [-]

I also wonder what kind of invalid moves they are. There's "you can't move your knight to j9 that's off the board", "there's already a piece there" and "actually that would leave you in check".

I think it's also significantly harder to play chess if you were to hear a sequence of moves over the phone and had to reply with a followup move, with no space or time to think or talk through moves.

stuaxo 14 hours ago [-]

I hate the use of words like "understand" in these conversations.

The system understands nothing, it's anthropomorphising it to say it does.

trashtester 13 hours ago [-]

I have the same conclusion, but for the opposite reason.

It seems like many people tend to use the word "understand" to that not only does someone believe that a given move is good, they also belive that this knowledge comes from a rational evaluation.

Some attribute this to a non-material soul/mind, some to quantum mechanics or something else that seems magic, while others never realized the problem with such a belief in the first place.

I would claim that when someone can instantly recognize good moves in a given situation, it doesn't come from rationality at all, but from some mix of memory and an intuition that has been build by playing the game many times, with only tiny elements of actual rational thought sprinkled in.

This even holds true when these people start to calculate. It is primarily their intuition that prevens them from spending time on all sorts of unlikely moves.

And this intuition, I think, represents most of their real "understanding" of the game. This is quite different from understanding something like a mathematical proof, which is almost exclusively inducive logic.

And since "understand" so often is associated with rational inductive logic, I think the proper term would be to have "good intuition" when playing the game.

And this "good intuition" seems to me precisely the kind of thing that is trained within most neural nets, even LLM's. (Q*, AlphaZero, etc also add the ability to "calculate", meaning traverse the search space efficiently).

If we wanted to measure how good this intuition is compared to human chess intuition, we could limit an engine like AlphaZero to only evaluate the same number of moves per second that good humans would be able to, which might be around 10 or so.

Maybe with this limitation, the engine wouldn't currently be able to beat the best humans, but even if it reaches a rating of 2000-2500 this way, I would say it has a pretty good intuitive understanding.

Sharlin 13 hours ago [-]

Trying to appropriate perfectly well generalizable terms as "something that only humans do" brings zero value to a conversation. It's a "god in the gaps" argument, essentially, and we don't exactly have a great track record of correctly identifying things that are uniquely human.

fao_ 12 hours ago [-]

There's very literally currently a whole wealth of papers proving that LLMs do not understand, cannot reason, and cannot perform basic kinds of reasoning that even a dog can perform. But, ok.

TeMPOraL 11 hours ago [-]

There's very literally currently a whole wealth of papers proving the opposite, too, so ¯\_(ツ)_/¯.

navane 15 hours ago [-]

Pretty sure elo 1200 will only give legal moves. It's really not hard to make legal moves in chess.

thaumasiotes 14 hours ago [-]

Casual players make illegal moves all the time. The problem isn't knowing how the pieces move. It's that it's illegal to leave your own king in check. It's not so common to accidentally move your king into check, though I'm sure it happens, but it's very common to accidentally move a piece that was blocking an attack on your king.

I would tend to agree that there's a big difference between attempting to make a move that's illegal because of the state of a different region of the board, and attempting to make one that's illegal because of the identity of the piece being moved, but if your only category of interest is "illegal moves", you can't see that difference.

Software that knows the rules of the game shouldn't be making either mistake.

philipwhiuk 10 hours ago [-]

Casual players don’t make illegal moves so often that you have to assign them a random move after 10 goes.

griomnib 1 day ago [-]

I think at this point it’s very clear LLM aren’t achieving any form of “reasoning” as commonly understood. Among other factors it can be argued that true reasoning involves symbolic logic and abstractions, and LLM are next token predictors.

Sharlin 13 hours ago [-]

What proof do you have that human reasoning involves "symbolic logic and abstractions"? In daily life, that is, not in a math exam. We know that people are actually quite bad at reasoning [1][2]. And it definitely doesn't seem right to define "reasoning" as only the sort that involves formal logic.

[1] https://en.wikipedia.org/wiki/List_of_fallacies

[2] https://en.wikipedia.org/wiki/List_of_cognitive_biases

trashtester 12 hours ago [-]

Some very intelligent people, including Gödel and Penrose, seem to think that humans have some kind of ability to arrive directly on correct propositions in ways that bypass the incompleteness theorem. Penrose seems to think this can be due to Quantum Mechanics, Göder may have thought it came frome something divine.

While I think they're both wrong, a lot of people seem to think they can do abstract reasoning for symbols or symbol-like structures without having to use formal logic for every step.

Personally, I think such beliefs about concepts like consciousness, free will, qualia and emotions emerge from how the human brain includes a simplified version of itself when setting up a world model. In fact, I think many such elements are pretty much hard coded (by our genes) into the machinery that human brains use to generate such world models.

Indeed, if this is true, concepts like consciousness, free will, various qualia and emotions can in fact be considered "symbols" within this world model. While the full reality of what happens in the brain when we exercise what we represent by "free will" may be very complex, the world model may assign a boolean to each action we (and others) perform, where the action is either grouped into "voluntary action" or "involuntary action".

This may not always be accurate, but it saves a lot of memory and compute costs for the brain when it tries to optimize for the future. This optimization can (and usually is) called "reasoning", even if the symbols have only an approximated correspondence with physical reality.

For instance, if in our world model somebody does something against us and we deem that it was done exercising "free will", we will be much more likely to punish them than if we categorize the action as "forced".

And on top of these basic concepts within our world model, we tend to add a lot more, also in symbol form, to enable us to use symbolic reasoning to support our interactions with the world.

TeMPOraL 11 hours ago [-]

> While I think they're both wrong, a lot of people seem to think they can do abstract reasoning for symbols or symbol-like structures without having to use formal logic for every step.

Huh.

I don't know bout incompleteness theorem, but I'd say it's pretty obvious (both in introspection and in observation of others) that people don't naturally use formal logic for anything, they only painstakingly emulate it when forced to.

If anything, "next token prediction" seems much closer to how human thinking works than anything even remotely formal or symbolic that was proposed before.

As for hardcoding things in world models, one thing that LLMs do conclusively prove is that you can create a coherent system capable of encoding and working with meaning of concepts without providing anything that looks like explicit "meaning". Meaning is not inherent to a term, or a concept expressed by that term - it exists in the relationships between an the concept, and all other concepts.

ben_w 11 hours ago [-]

> I don't know bout incompleteness theorem, but I'd say it's pretty obvious (both in introspection and in observation of others) that people don't naturally use formal logic for anything, they only painstakingly emulate it when forced to.

Indeed, this is one reason why I assert that Wittgenstein was wrong about the nature of human thought when writing:

"""If there were a verb meaning "to believe falsely," it would not have any significant first person, present indicative."""

Sure, it's logically incoherent for us to have such a word, but there's what seems like several different ways for us to hold contradictory and incoherent beliefs within our minds.

brookst 1 day ago [-]

> Among other factors it can be argued that true reasoning involves symbolic logic and abstractions, and LLM are next token predictors.

I think this is circular?

If an LLM is "merely" predicting the next tokens to put together a description of symbolic reasoning and abstractions... how is that different from really exercisng those things?

Can you give me an example of symbolic reasoning that I can't handwave away as just the likely next words given the starting place?

I'm not saying that LLMs have those capabilities; I'm question whether there is any utility in distinguishing the "actual" capability from identical outputs.

vidarh 15 hours ago [-]

It is. As it stands, throw a loop around an LLM and act as the tape, and an LLM can obviously be made Turing complete (you can get it to execute all the steps of a minimal Turing machine, so drop temperature so its deterministic, and you have a Turing complete system). To argue that they can't be made to reason is effectively to argue that there is some unknown aspect of the brain that allows us to compute functions not in the Turing computable set, which would be an astounding revelation if it could be proven. Until someone comes up with evidence for that, it is more reasonable to assume that it is a question of whether we have yet found a training mechanism that can lead to reasoning or not, not whether or not LLMs can learn to.

vundercind 10 hours ago [-]

It doesn’t follow that because a system is Turing complete the approach being used will eventually achieve reasoning.

griomnib 1 day ago [-]

Mathematical reasoning is the most obvious area where it breaks down. This paper does an excellent job of proving this point with some elegant examples: https://arxiv.org/pdf/2410.05229

brookst 1 day ago [-]

Sure, but people fail at mathematical reasoning. That doesn't mean people are incapable of reasoning.

I'm not saying LLMs are perfect reasoners, I'm questioning the value of asserting that they cannot reason with some kind of "it's just text that looks like reasoning" argument.

NBJack 23 hours ago [-]

The idea is the average person would, sure. A mathematically oriented person would fair far better.

Throw all the math problems you want at a LLM for training; it will still fail if you step outside of the familiar.

dartos 1 day ago [-]

People can communicate each step, and review each step as that communication is happening.

LLMs must be prompted for everything and don’t act on their own.

The value in the assertion is in preventing laymen from seeing a statistical guessing machine be correct and assuming that it always will be.

It’s dangerous to put so much faith in what in reality is a very good guessing machine. You can ask it to retrace its steps, but it’s just guessing at what it’s steps were, since it didn’t actually go through real reasoning, just generated text that reads like reasoning steps.

Workaccount2 1 day ago [-]

Maybe I am not understanding the paper correctly, but it seems they tested "state of the art models" which is almost entirely composed of open source <27B parameter models. Mostly 8B and 3B models. This is kind of like giving algebra problems to 7 year olds to "test human algebra ability."

If you are holding up a 3B parameter model as an example of "LLM's can't reason" I'm not sure if the authors are confused or out of touch.

I mean, they do test 4o and O1 preview, but their performance is notablely absent from the paper's conclusion.

dartos 1 day ago [-]

It’s difficult to reproducibly test openai models, since they can change from under you and you don’t have control over every hyperparameter.

It would’ve been nice to see one of the larger llama models though.

dartos 1 day ago [-]

There isn’t much utility, but tbf the outputs aren’t identical.

One danger is the human assumption that, since something appears to have that capability in some settings, it will have that capability in all settings.

Thats a recipe for exploding bias, as we’ve seen with classic statistical crime detection systems.

NBJack 23 hours ago [-]

Inferring patterns in unfamiliar problems.

Take a common word problem in a 5th grade math text book. Now, change as many words as possible; instead of two trains, make it two different animals; change the location to a rarely discussed town; etc. Even better, invent words/names to identify things.

Someone who has done a word problem like that will very likely recognize the logic, even if the setting is completely different.

Word tokenization alone should fail miserably.

djmips 21 hours ago [-]

I have noted over my life that a lot of problems end up being a variation on solved problems from another more familiar domain but frustratingly take a long time to solve before realizing this was just like that thing you had already solved. Nevertheless, I do feel like humans do benefit from identifying meta patterns but as the chess example shows even we might be weak in unfamiliar areas.

Propelloni 15 hours ago [-]

Learn how to solve one problem and apply the approach, logic and patterns to different problems. In German that's called "Transferleistung" (roughly "transfer success") and a big thing at advanced schools. Or, at least my teacher friends never stop talking about it.

We get better at it over time, as probably most of us can attest.

xg15 1 day ago [-]

I don't want to say that LLMs can reason, but this kind of argument always feels to shallow for me. It's kind of like saying that bats cannot possibly fly because they have no feathers or that birds cannot have higher cognitive functions because they have no neocortex. (The latter having been an actual longstanding belief in science which has been disproven only a decade or so ago).

The "next token prediction" is just the API, it doesn't tell you anything about the complexity of the thing that actually does the prediction. (In think there is some temptation to view LLMs as glorified Markov chains - they aren't. They are just "implementing the same API" as Markov chains).

There is still a limit how much an LLM could reason during prediction of a single token, as there is no recurrence between layers, so information can only be passed "forward". But this limit doesn't exist if you consider the generation of the entire text: Suddenly, you do have a recurrence, which is the prediction loop itself: The LLM can "store" information in a generated token and receive that information back as input in the next loop iteration.

I think this structure makes it quite hard to really say how much reasoning is possible.

vidarh 15 hours ago [-]

> But this limit doesn't exist if you consider the generation of the entire text: Suddenly, you do have a recurrence, which is the prediction loop itself: The LLM can "store" information in a generated token and receive that information back as input in the next loop iteration.

Now consider that you can trivially show that you can get an LLM to "execute" on step of a Turing machine where the context is used as an IO channel, and will have shown it to be Turing complete.

> I think this structure makes it quite hard to really say how much reasoning is possible.

Given the above, I think any argument that they can't be made to reason is effectively an argument that humans can compute functions outside the Turing computable set, which we haven't the slightest shred of evidence to suggest.

griomnib 23 hours ago [-]

I agree with most of what you said, but “LLM can reason” is an insanely huge claim to make and most of the “evidence” so far is a mixture of corporate propaganda, “vibes”, and the like.

I’ve yet to see anything close to the level of evidence needed to support the claim.

hackinthebochs 10 hours ago [-]

Then say "no one has demonstrated that LLMs can reason" instead of "LLMs can't reason, they're just token predictors". At least that would be intellectually honest.

vidarh 15 hours ago [-]

To say any specific LLM can reason is a somewhat significant claim.

To say LLMs as a class is architecturally able to be trained to reason is - in the complete absence of evidence to suggest humans can compute functions outside the Turing computable - is effectively only an argument that they can implement a minimal Turing machine given the context is used as IO. Given the size of the rules needed to implement the smallest known Turing machines, it'd take a really tiny model for them to be unable to.

Now, you can then argue that it doesn't "count" if it needs to be fed a huge program step by step via IO, but if it can do something that way, I'd need some really convincing evidence for why the static elements those steps could not progressively be embedded into a model.

Propelloni 15 hours ago [-]

It's largely dependent on what we think "reason" means, is it not? That's not a pro argument from me, in my world LLMs are stochastic parrots.

Scarblac 14 hours ago [-]

This is the argument that submarines don't really "swim" as commonly understood, isn't it?

saithound 12 hours ago [-]

I think so, but the badness of that argument is context-dependent. How about the hypothetical context where 70k+ startups are promising investors that they'll win the 50 meter freestyle in 2028 by entering a fine-tuned USS Los Angeles?

Jensson 13 hours ago [-]

And planes doesn't fly like a bird, it has very different properties and many things birds can do can't be done by a plane. What they do is totally different.

olalonde 12 hours ago [-]

This argument reminds me the classic "intelligent design" critique of evolution: "Evolution can't possibly create an eye; it only works by selecting random mutations." Personally, I don't see why a "next token predictor" couldn't develop the capability to reason and form abstractions.

Uehreka 1 day ago [-]

Does anyone have a hard proof that language doesn’t somehow encode reasoning in a deeper way than we commonly think?

I constantly hear people saying “they’re not intelligent, they’re just predicting the next token in a sequence”, and I’ll grant that I don’t think of what’s going on in my head as “predicting the next token in a sequence”, but I’ve seen enough surprising studies about the nature of free will and such that I no longer put a lot of stock in what seems “obvious” to me about how my brain works.

spiffytech 1 day ago [-]

> I’ll grant that I don’t think of what’s going on in my head as “predicting the next token in a sequence”

I can't speak to whether LLMs can think, but current evidence indicates humans can perform complex reasoning without the use of language:

> Brain studies show that language is not essential for the cognitive processes that underlie thought.

> For the question of how language relates to systems of thought, the most informative cases are cases of really severe impairments, so-called global aphasia, where individuals basically lose completely their ability to understand and produce language as a result of massive damage to the left hemisphere of the brain. ...

> You can ask them to solve some math problems or to perform a social reasoning test, and all of the instructions, of course, have to be nonverbal because they can’t understand linguistic information anymore. ...

> There are now dozens of studies that we’ve done looking at all sorts of nonlinguistic inputs and tasks, including many thinking tasks. We find time and again that the language regions are basically silent when people engage in these thinking activities.

https://www.scientificamerican.com/article/you-dont-need-wor...

cortic 12 hours ago [-]

> ..individuals basically lose completely their ability to understand and produce language as a result of massive damage to the left hemisphere of the brain. ...

The right hemisphere almost certainly uses internal 'language' either consciously or unconsciously to define objects, actions, intent.. the fact that they passed these tests is evidence of that. The brain damage is simply stopping them expressing that 'language'. But the existence of language was expressed in the completion of the task..

SAI_Peregrinus 22 hours ago [-]

I'd say that's a separate problem. It's not "is the use of language necessary for reasoning?" which seems to be obviously answered "no", but rather "is the use of language sufficient for reasoning?".

hathawsh 1 day ago [-]

I think the question we're grappling with is whether token prediction may be more tightly related to symbolic logic than we all expected. Today's LLMs are so uncannily good at faking logic that it's making me ponder logic itself.

griomnib 1 day ago [-]

I felt the same way about a year ago, I’ve since changed my mind based on personal experience and new research.

hathawsh 1 day ago [-]

Please elaborate.

dartos 1 day ago [-]

I work in the LLM search space and echo OC’s sentiment.

The more I work with LLMs the more the magic falls away and I see that they are just very good at guessing text.

It’s very apparent when I want to get them to do a very specific thing. They get inconsistent about it.

DiogenesKynikos 1 day ago [-]

Effective next-token prediction requires reasoning.

You can also say humans are "just XYZ biological system," but that doesn't mean they don't reason. The same goes for LLMs.

griomnib 1 day ago [-]

Take a word problem for example. A child will be told the first step is to translate the problem from human language to mathematical notation (symbolic representation), then solve the math (logic).

A human doesn’t use next token prediction to solve word problems.

Majromax 1 day ago [-]

But the LLM isn't "using next-token prediction" to solve the problem, that's only how it's evaluated.

The "real processing" happens through the various transformer layers (and token-wise nonlinear networks), where it seems as if progressively richer meanings are added to each token. That rich feature set then decodes to the next predicted token, but that decoding step is throwing away a lot of information contained in the latent space.

If language models (per Anthropic's work) can have a direction in latent space correspond to the concept of the Golden Gate Bridge, then I think it's reasonable (albeit far from certain) to say that LLMs are performing some kind of symbolic-ish reasoning.

griomnib 1 day ago [-]

Anthropic had a vested interest in people thinking Claude is reasoning.

However, in coding tasks I’ve been able to find it directly regurgitating Stack overflow answers (like literally a google search turns up the code).

Giving coding is supposed to be Claude’s strength, and it’s clearly just parroting web data, I’m not seeing any sort of “reasoning”.

LLM may be useful but they don’t think. They’ve already plateaued, and given the absurd energy requirements I think they will prove to be far less impactful than people think.

TeMPOraL 1 day ago [-]

> A human doesn’t use next token prediction to solve word problems.

Of course they do, unless they're particularly conscientious noobs that are able to repeatedly execute the "translate to mathematical notation, then solve the math" algorithm, without going insane. But those people are the exception.

Everyone else either gets bored half-way through reading the problem, or has already done dozens of similar problems before, or both - and jump straight to "next token prediction", aka. searching the problem space "by feels", and checking candidate solutions to sub-problems on the fly.

This kind of methodical approach you mention? We leave that to symbolic math software. The "next token prediction" approach is something we call "experience"/"expertise" and a source of the thing we call "insight".

vidarh 14 hours ago [-]

Indeed. Work on any project that requires humans to carry out largely repetitive steps, and a large part of the problem involves how to put processes around people to work around humans "shutting off" reasoning and going full-on automatic.

E.g. I do contract work on an LLM-related project where one of the systemic changes introduced - in addition to multiple levels of quality checks - is to force to make people input a given sentence word for word followed by a word from a set of 5 or so, and a minority of the submissions get that sentence correct including the final word despite the system refusing to let you submit unless the initial sentence is correct. Seeing the data has been an absolutely shocking indictment of human reasoning.

These are submissions from a pool of people who have passed reasoning tests...

When I've tested the process myself as well, it takes only a handful of steps before the tendency is to "drift off" and start replacing a word here and there and fail to complete even the initial sentence without a correction. I shudder to think how bad the results would be if there wasn't that "jolt" to try to get people back to paying attention.

Keeping humans consistently carrying out a learned process is incredibly hard.

fragmede 1 day ago [-]

is that based on a vigorous understanding of how humans think, derived from watching people (children) learn to solve word problems? How do thoughts get formed? Because I remember being given word problems with extra information, and some children trying to shove that information into a math equation despite it not being relevant. The "think things though" portion of ChatGPT o1-preview is hidden from us, so even though a o1-preview can solve word problems, we don't know how it internally computes to arrive at that answer. But we do we really know how we do it? We can't even explain consciousness in the first place.

nuancebydefault 1 day ago [-]

After reading the article I am more convinced it does reasoning. The base model's reasoning capabilities are partly hidden by the chatty derived model's logic.

BurningFrog 1 day ago [-]

Not that I understand the internals of current AI tech, but...

I'd expect that an AI that has seen billions of chess positions, and the moves played in them, can figure out the rules for legal moves without being told?

rscho 1 day ago [-]

Statistical 'AI' doesn't 'understand' anything, strictly speaking. It predicts a move with high probability, which could be legal or illegal.

Helonomoto 1 day ago [-]

How do you define 'understand'?

There is plenty of AI which learns the rules of games like Alpha Zero.

LLMs might not have the architecture to 'learn', but it also might. If it optimizes all possible moves one chess peace can do (which is not that much to learn) it can easily only 'move' from one game set to another by this type of dictionary.

chongli 16 hours ago [-]

Neither AlphaZero nor MuZero can learn the rules of chess from an empty chess board and a pile of pieces. There is no objective function so there’s nothing to train upon.

That would be like alien archaeologists of the future finding a chess board and some pieces in a capsule orbiting Mars after the total destruction of Earth and all recorded human thought. The archaeologists could invent their own games to play on the chess board but they’d have no way of ever knowing they were playing chess.

rscho 18 hours ago [-]

Understanding a rules-based system (chess) means to be able to learn non-probabilistic rules (an abstraction over the concrete world). Humans are a mix of symbolic and probabilistic learning, allowing them to get a huge boost in performance by admitting rules. It doesn't mean a human will never make an illegal move, but it means a much smaller probability of illegal move based on less training data. Asymptotically, performance from humans and purely probabilistic systems converge. But that also means that in appropriate situations, humans are hugely more data-efficient.

david-gpu 13 hours ago [-]

> in appropriate situations, humans are hugely more data-efficient

After spending some years raising my children I gave up the notion that humans are data efficient. It takes a mind numbing amount of training to get them to learn the most basic skills.

fragmede 1 day ago [-]

The illegal moves are interesting as it goes to "understanding". In children learning to play chess, how often do they try and make illegal moves? When first learning the game I remember that I'd lose track of all the things going on at once and try to make illegal moves, but eventually the rules became second nature and I stopped trying to make illegal moves. With an ELO of 1800, I'd expect ChatGPT not to make any illegal moves.

griomnib 1 day ago [-]

Likewise with LLM you don’t know if it is truly in the “chess” branch of the statistical distribution or it is picking up something else entirely, like some arcane overlap of tokens.

So much of the training data (eg common crawl, pile, Reddit) is dogshit, so it generates reheated dogshit.

Helonomoto 1 day ago [-]

You generalize this without mentioning that there are LLMs which do not just use random 'dogshit'.

Also what does a normal human do? It looks around how to move one random piece and it uses a very small dictionary / set of basic rules to move it. I do not remember me learning to count every piece and its options by looking up that rulebook. I learned to 'see' how i can move one type of chess piece.

If a LLM uses only these piece moves on a mathematical level, it would do the same thing as i do.

And yes there is also absolutly the option for an LLM to learn some kind of meta game.

pvitz 1 day ago [-]

A system that would just output the most probable tokens based on the text it was fed and trained on the games played by players with ratings greater than 1800 would certainly fail to output the right moves to totally unlikely board positions.

Helonomoto 1 day ago [-]

Yes in theory it could. Depends on how it learns. Does it learn by memorization or by learning the rules. It depends on the architecture and the amount of 'pressure' you put on it to be more efficient or not.

namaria 13 hours ago [-]

Assigning "understanding" to an undefined entity is an undefined statement.

It isn't even wrong.

thaumasiotes 14 hours ago [-]

> Here's one way to test whether it really understands chess. Make it play the next move in 1000 random legal positions

Suppose it tries to capture en passant. How do you know whether that's legal?

BalinKing 10 hours ago [-]

I feel like you could add “do not capture en passant unless it is the only possible move” to the test without changing what it’s trying to prove—if anything, some small permutation like this might even make it a stronger test of “reasoning capability.” (Personally I’m unconvinced of the utility of this test in the first place, but I think it can be reasonably steelmanned.)

cma 15 hours ago [-]

Its training set would include a lot of randomly generated positions like that that then get played out by chess engines wouldn't it? Just from people messing around andbposting results. Not identical ones, but similarly oddball.

fragmede 1 day ago [-]

How well does it play modified versions of chess? eg, a modified opening board like the back row is all knights, or modified movement eg rooks can move like a queen. A human should be able to reason their way through playing a modified game, but I'd expect an LLM, if it's just parroting its training data, to suggest illegal moves, or stick to previously legal moves.

codeflo 11 hours ago [-]

> everyone is wrong!

Well, not everyone. I wasn't the only one to mention this, so I'm surprised it didn't show up in the list of theories, but here's e.g. me, seven days ago (source https://news.ycombinator.com/item?id=42145710):

> At this point, we have to assume anything that becomes a published benchmark is specifically targeted during training.

This is not the same thing as cheating/replacing the LLM output, the theory that's mentioned and debunked in the article. And now the follow-up adds weight to this guess:

> Here’s my best guess for what is happening: ... OpenAI trains its base models on datasets with more/better chess games than those used by open models. ... Meanwhile, in section A.2 of this paper (h/t Gwern) some OpenAI authors mention that GPT-4 was trained on chess games in PGN notation, filtered to only include players with Elo at least 1800.

To me, it makes complete sense that OpenAI would "spike" their training data with data for tasks that people might actually try. There's nothing unethical about this. No dataset is ever truly "neutral", you make choices either way, so why not go out of your way to train the model on potentially useful answers?

dr_dshiv 10 hours ago [-]

I made a suggestion that they may have trained the model to be good at chess to see if it helped with general intelligence, just as training with math and code seems to improve other aspects of logical thinking. Because, after all, OpenAI has a lot of experience with game playing AI. https://news.ycombinator.com/item?id=42145215

demaga 11 hours ago [-]

Yes, and I would like this approach to also be used in other, more practical areas. I mean, more "expert" content than "amateur" content in training data, regardless of area of expertise.

stingraycharles 11 hours ago [-]

Yup, I remember reading your comment and that making the most sense to me.

OpenAI just shifted their training targets, initially they thought Chess was cool, maybe tomorrow they think Go is cool, or maybe the ability to write poetry. Who knows.

But it seems like the simplest explanation and makes the most sense.

qup 11 hours ago [-]

At current sizes, these things are like humans. They gotta specialize.

Maybe that'll be enough moat to save us from AGI.

xg15 1 day ago [-]

> In many ways, this feels less like engineering and more like a search for spells.

This is still my impression of LLMs in general. It's amazing that they work, but for the next tech disruption, I'd appreciate something that doesn't make you feel like being in a bad sci-fi movie all the time.

code51 5 hours ago [-]

Initially LLM researchers were saying training on code samples made the "reasoning" better. Now, if "language to world model" thesis is working, shouldn't chess actually be the smallest case for it?

I can't understand why no research group is going hard at this.

throwaway314155 3 hours ago [-]

I don't think training on code and training on chess are even remotely comparable in terms of available data and linguistic competency required. Coding (in the general case, which is what these models try to approach) is clearly the harder task and contains _massive_ amounts of diverse data.

Having said all of that, it wouldn't surprise me if the "language to world model" thesis you reference is indeed wrong. But I don't think a model that plays chess well disproves it, particularly since there are chess engines using old fashioned approaches that utterly destroy LLM's.

derefr 5 hours ago [-]

> Many, many people suggested that there must be some special case in gpt-3.5-turbo-instruct that recognizes chess notation and calls out to an external chess engine.

Not that I think there's anything inherently unreasonable about an LLM understanding chess, but I think the author missed a variant hypothesis here:

What if that specific model, when it recognizes chess notation, is trained to silently "tag out" for another, more specialized LLM, that is specifically trained on a majority-chess dataset? (Or — perhaps even more likely — the model is trained to recognize the need to activate a chess-playing LoRA adapter?)

It would still be an LLM, so things like "changing how you prompt it changes how it plays" would still make sense. Yet it would be one that has spent a lot more time modelling chess than other things, and never ran into anything that distracted it enough to catastrophically forget how chess works (i.e. to reallocate some of the latent-space vocabulary on certain layers from modelling chess, to things that matter more to the training function.)

And I could certainly see "playing chess" as a good proving ground for testing the ability of OpenAI's backend to recognize the need to "loop in" a LoRA in the inference of a response. It's something LLM base models suck at; but it's also something you intuitively could train an LLM to do (to at least a proficient-ish level, as seen here) if you had a model focus on just learning that.

Thus, "ability of our [framework-mediated] model to play chess" is easy to keep an eye on, long-term, as a proxy metric for "how well our LoRA-activation system is working", without needing to worry that your next generation of base models might suddenly invalidate the metric by getting good at playing chess without any "help." (At least not any time soon.)

throwaway314155 3 hours ago [-]

> but I think the author missed a variant hypothesis here:

> What if that specific model, when it recognizes chess notation, is trained to silently "tag out" for another, more specialized LLM, that is specifically trained on a majority-chess dataset? (Or — perhaps even more likely — the model is trained to recognize the need to activate a chess-playing LoRA adapter?)

Pretty sure your variant hypothesis is sufficiently covered by the author's writing.

So strange that people are so attached to conspiracy theories in this instance. Why would OpenAI or anyone go through all the trouble? The proposals outlined in the article make far more sense and track well with established research (namely that applying RLHF to a "text-only" model tends to wreak havoc on said model).

marcus_holmes 16 hours ago [-]

I notice there's no prompt saying "you should try to win the game" yet the results are measured by how much the LLM wins.

Is this implicit in the "you are a grandmaster chess player" prompt?

Is there some part of the LLM training that does "if this is a game, then I will always try to win"?

Could the author improve the LLM's odds of winning just by telling it to try and win?

tinco 13 hours ago [-]

I think you're putting too much weight on its intentions, it doesn't have intentions it is a mathematical model that is trained to give the most likely outcome.

In almost all examples and explanations it has seen from chess games, each player would be trying to win, so it is simply the most logical thing for it to make a winning move. So I wouldn't expect explicitly prompting it to win to improve its performance by much if at all.

The reverse would be interesting though, if you would prompt it to make losing/bad moves, would it be effective in doing so, and would the moves still be mostly legal? That might reveal a bit more about how much relies on concepts it's seen before.

tananan 12 hours ago [-]

It would surely just be fluff in the prompt. The model's ability to generate chess sequences will be bounded by the expertise in the pool of games in the training set.

Even if the pool was poisoned by games in which some players are trying to lose (probably insignificant), no one annotates player intent in chess games, and so prompting it to win or lose doesn't let the LLM pick up on this.

You can try this by asking an LLM to play to lose. ChatGPT ime tries to set itself up for scholar's mate, but if you don't go for it, it will implicitly start playing to win (e.g. taking your unprotected pieces). If you ask it "why?", it gives you the usual bs post-hoc rationalization.

danw1979 11 hours ago [-]

> It would surely just be fluff in the prompt. The model's ability to generate chess sequences will be bounded by the expertise in the pool of games in the training set.

There are drawn and loosing games in the training set though.

Nashooo 16 hours ago [-]

IMO this is clearly implicit in the "you are a grandmaster chess player" prompt. As that should make generating best possible move tokens more likely.

Ferret7446 15 hours ago [-]

Is it? What if the AI is better than a grandmaster chess player and is generating the most likely next move that a grandmaster chess player might make and not the most likely move to win, which may be different?

lukan 14 hours ago [-]

Depends on the training data I think. If the data divides in games by top chess engines - and human players, then yes, it might make a difference to tell it, to play like a grandmaster of chess vs. to play like the top chess engine.

montjoy 12 hours ago [-]

I came to the comments to say this too. If you were prompting it to generate code, you generally get better results when you ask it for a result. You don’t just tell it, “You are a python expert and here is some code”. You give it a direction you want the code to go. I was surprised that there wasn’t something like, “and win”, or, “black wins”, etc.

boredhedgehog 12 hours ago [-]

Further, the prompt also says to "choose the next move" instead of the best move.

It would be fairly hilarious if the reinforcement training has made the LLM unwilling to make the human feel bad through losing a game.

Jean-Papoulos 16 hours ago [-]

>According to that figure, fine-tuning helps. And examples help. But it’s examples that make fine-tuning redundant, not the other way around.

This is extremely interesting. In this specific case at least, simply giving examples is equivalent to fine-tuning. This is a great discovery for me, I'll try using examples more often.

s5ma6n 12 hours ago [-]

Agreed on providing examples is definitely a useful insight vs fine-tuning.

While it is not very important for this toy case, it's good to keep in mind that each provided example in the input will increase the prediction time and cost compared to fine-tuning.

jdthedisciple 16 hours ago [-]

To me this is very intuitively true.

I can't explain why.I always had the intuition that fine-tuning was overrated.

One reason perhaps is that examples are "right there" and thus implicitly weighted much more in relation to the fine-tuned neurons.

viraptor 1 day ago [-]

I'm glad he improved the promoting, but he's still leaving out two likely huge improvements.

1. Explain the current board position and the plan going forwards, before proposing a move. This lets the model actually think more, kind of like o1, but here it would guarantee a more focused processing.

2. Actually draw the ascii board for each step. Hopefully producing more valid moves since board + move is easier to reliably process than 20×move.

duskwuff 1 day ago [-]

> 2. Actually draw the ascii board for each step.

I doubt that this is going to make much difference. 2D "graphics" like ASCII art are foreign to language models - the models perceive text as a stream of tokens (including newlines), so "vertical" relationships between lines of text aren't obvious to them like they would be to a human viewer. Having that board diagram in the context window isn't likely to help the model reason about the game.

Having the model list out the positions of each piece on the board in plain text (e.g. "Black knight at c5") might be a more suitable way to reinforce the model's positional awareness.

magicalhippo 15 hours ago [-]

I've had some success getting models to recognize simple electronic circuits drawn using ASCII art, including stuff like identifying a buck converter circuit in various guises.

However, as you point out, the way we feed these models especially make them vertically challenged, so to speak. This makes them unable to reliably identify vertically separated components in a circuit for example.

With combined vision+text models becoming more common place, perhaps running the rendered text input through the vision model might help.

yccs27 1 day ago [-]

With positional encoding, an ascii board diagram actually shouldn't be that hard to read for an LLM. Columns and diagonals are just different strides through the flattened board representation.

TeMPOraL 1 day ago [-]

RE 2., I doubt it'll help - for at least two reasons, already mentioned by 'duskwuff and 'daveguy.

RE 1., definitely worth trying, and there's more variants of such tricks specific to models. I'm out of date on OpenAI docs, but with Anthropic models, the docs suggest using XML notation to label and categorize most important parts of the input. This kind of soft structure seems to improve the results coming from Claude models; I imagine they specifically trained the model to recognize it.

See: https://docs.anthropic.com/en/docs/build-with-claude/prompt-...

In author's case, for Anthropic models, the final prompt could look like this:

  <role>You are a chess grandmaster.</role>
  <instructions>
  You will be given a partially completed game, contained in <game-log> tags.
  After seeing it, you should repeat the ENTIRE GAME and then give ONE new move
  Use standard algebraic notation, e.g. "e4" or "Rdf8" or "R1a3".
  ALWAYS repeat the entire representation of the game so far, putting it in <new-game-log> tags.
  Before giving the new game log, explain your reasoning inside <thinking> tag block.
  </instructions>
  
  <example>
    <request>
      <game-log>
        *** example game ***
      </game-log>
    </request>
    <reply>
      <thinking> *** some example explanation ***</thinking>
      <new-game-log> *** game log + next move *** </new-game-log>
    </reply>   
   
  </example>
  
  <game-log>
   *** the incomplete game goes here ***
  </game-log>

This kind of prompting is supposed to provide noticeable improvement for Anthropic models. Ironically, I only discovered it few weeks ago, despite having been using Claude 3.5 Sonnet extensively for months. Which goes to say, RTFM is still a useful skill. Maybe OpenAI models have similar affordances too, simple but somehow unnoticed? (I'll re-check the docs myself later.)

tedsanders 15 hours ago [-]

Chain of thought helps with many problems, but it actually tanks GPT’s chess performance. The regurgitation trick was the best (non-fine tuning) technique in my own chess experiments 1.5 years ago.

daveguy 1 day ago [-]

> Actually draw the ascii board for each step.

The relative rarity of this representation in training data means it would probably degrade responses rather than improve them. I'd like to see the results of this, because I would be very surprised if it improved the responses.

unoti 1 day ago [-]

I came here to basically say the same thing. The improvements the OP saw by asking it to repeat all the moves so far gives the LLM more time and space to think. I have this hypothesis giving it more time and space to think in other ways could improve performance even more, something like showing the current board position and asking it to perform an analysis of the position, list key challenges and strengths, asking it for a list of strategies possible from here, then asking it to select a strategy amongst the listed strategies, then asking it for its move. In general, asking it to really think rather than blurt out a move. The examples would be key here.

These ideas were proven to work very well in the ReAct paper (and by extension, the CoT Chain of Thought paper). Could also extend this by asking it to do this N times and stop when we get the same answer a majority of times (this is an idea stolen from the CoT-SC paper, chain of through self-consistency).

viraptor 1 day ago [-]

It would be awesome if the author released a framework to play with this. I'd like to test things out, but I don't want to spend time redoing all his work from scratch.

fragmede 1 day ago [-]

Just have ChatGPT write the framework

ilaksh 1 day ago [-]

The fact that he hasn't tried this leads me to think that deep down he doesn't want the models to succeed and really just wants to make more charts.

PaulHoule 1 day ago [-]

People have to quit this kind of stumbling in the dark with commercial LLMs.

To get to the bottom of this it would be interesting to train LLMs on nothing but chess games (can synthesize them endlessly by having Stockfish play against itself) with maybe a side helping of chess commentary and examples of chess dialogs “how many pawns are on the board?”, “where are my rooks?”, “draw the board”, competence at which would demonstrate that it has a representation of the board.

I don’t believe in “emergent phenomena” or that the general linguistic competence or ability to feign competence is necessary for chess playing (being smart at chess doesn’t mean you are smart at other things and vice versa). With experiments like this you might prove me wrong though.

This paper came out about a week ago

https://arxiv.org/pdf/2411.06655

seems to get good results with a fine-tuned Llama. I also like this one as it is about competence in chess commentary

https://arxiv.org/abs/2410.20811

toxik 10 hours ago [-]

Predicting next moves of some expert chess policy is just imitation learning, a well-studied proposal. You can add return-to-go to let the network try to learn what kinds of moves are made in good vs bad games, which would be an offline RL regime (eg, Decision Transformers).

I suspect chess skill is completely useless for LLMs in general and not an emergent phenomenon, just consuming gradient bandwidth and parameter space to do this neat trick. This is clear to me because the LLMs that aren't trained specifically on chess do not do chess well.

__MatrixMan__ 1 hour ago [-]

It would be fun to play against an LLM without having to think about the prompting, if only as a novel way to get a "feel" for how they "think".

XenophileJKO 3 hours ago [-]

So this article is what happens when people who don't really understand the models "test" things.

There are several fatal flaws.

The first problem is that he isn't clearly and concisely displaying the current board state. He is expecting the model to attend a move sequence to figure out the board state.

Secondly, he isn't allowing the model to think elastically using COT or other strategies.

Honestly, I am shocked it is working at all. He has basically formulated the problem in the worst possible way.

yeevs 3 hours ago [-]

I'm not sure COT would help in this situation. I am an amateur at chess but in my experience a large part of playing is intuition and making and I'm not confident the model could even accurately summarise its thinking. There are tasks in which models perform worse on when explaining reasoning. However, this is completely vibes based.

XenophileJKO 3 hours ago [-]

Given my experience with the models, giving it the ability to think would allow it to attend to different ramifications of the current board layout. I would expect a non trivial performance gain.

timzaman 1 hour ago [-]

"all LLMs" - OP only tested OpenAI LLMs. Try Gemini.

bob1029 4 hours ago [-]

I find it amusing that we would frame an ensemble of models as "cheating". Routing to a collection of specialized models via classification layers seems like the most obvious path for adding practical value to these solutions.

Why conflate the parameters of chess with checkers and go if you already have high quality models for each? I thought tool use and RAG were fair game.

jey 1 day ago [-]

Could be interesting to create a tokenizer that’s optimized for representing chess moves and then training a LLM (from scratch?) on stockfish games. (Using a custom tokenizer should improve the quality for a given size of the LLM model. So it doesn’t have to waste a lot of layers on encode and decode, and the “natural” latent representation is more straightforward)

bee_rider 5 hours ago [-]

Extremely tangential, but how to chess engines do when playing from illegal board states? Could the LLM have a chance of competing with a real chess engine from there?

Understanding is a funny concept to try to apply to computer programs anyway. But playing from an illegal state seems (to me at least) to indicate something interesting about the ability to comprehend the general idea of chess.

layman51 1 hour ago [-]

I just checked the Lichess board editor tool and it won’t let you continue analyzing with an engine if you have an illegal board setup (like the two kings adjacent to each other).

I haven’t tried this yet, but I think you can set up and analyze board positions that are legal but that could never be reached in a real game (e.g, having nine pawns of one color).

sourcepluck 13 hours ago [-]

> Since gpt-3.5-turbo-instruct has been measured at around 1800 Elo

Where's the source for this? What's the reasoning? I don't see it. I have just relooked, and stil l can't see it.

Is it 1800 lichess "Elo", or 1800 FIDE, that's being claimed? And 1800 at what time control? Different time controls have different ratings, as one would imagine/hope the author knows.

I'm guessing it's not 1800 FIDE, as the quality of the games seems far too bad for that. So any clarity here would be appreciated.

og_kalu 10 hours ago [-]

https://github.com/adamkarvonen/chess_gpt_eval

subarctic 8 hours ago [-]

The author either didn't read the hacker news comments last time, or he missed the top theory that said they probably used chess as a benchmark when they developed the model that is good at chess for whatever business reasons they had at the time.

devindotcom 5 hours ago [-]

fwiw this is exactly what i thought - oai pursued it as a skillset (likely using a large chess dataset) for their own reasons and then abandoned it as not particularly beneficial outside chess.

It's still interesting to try to replicate how you would make a generalist LLM good at chess, so i appreciated the post, but I don't think there's a huge mystery!

wavemode 7 hours ago [-]

This is plausible. One of the top chess engines in the world (Leela) is just a neural network trained on billions of chess games.

So it makes sense that an LLM would also be able to acquire some skill by simply having a large volume of chess games in its training data.

OpenAI probably just eventually decided it wasn't useful to keep pursuing chess skill.

brcmthrowaway 5 hours ago [-]

Oh really! What happened to the theory that training on code magically caused some high level reasoning ability?

tech_ken 8 hours ago [-]

> It’s ridiculously hard to find the optimal combination of prompts and examples and fine-tuning, etc. It’s a very large space, there are no easy abstractions to allow you to search through the space, LLMs are unpredictable and fragile, and these experiments are slow and expensive.

Regardless of the actual experiment outcome, I think this is a super valuable insight. "Should we provide legal moves?" section is an excellent case study of this- extremely prudent idea actually degrades model performance, and quite badly. It's like that crocodile game where you're pushing teeth until it clamps onto your hand.

tmalsburg2 1 day ago [-]

Why not use temperature 0 for sampling? If the top-ranked move is not legal, it can’t play chess.

thornewolf 1 day ago [-]

sometimes skilled chess players make illegal moves

atiedebee 17 hours ago [-]

Extremely rare. The only time this happened that I'm aware of was quite recent but the players only had a second or 2 remaining on the clock, so time pressure is definitely the reason there

GaggiX 12 hours ago [-]

It often happens when the players play blondfold chess, as in this case.

kibwen 1 day ago [-]

> I was astonished that half the internet is convinced that OpenAI is cheating.

If you have a problem and all of your potential solutions are unlikely, then it's fine to assume the least unlikely solution while acknowledging that it's statistically probable that you're also wrong. IOW if you have ten potential solutions to a problem and you estimate that the most likely solution has an 11% chance of being true, it's fine to assume that solution despite the fact that, by your own estimate, you have an 89% chance of being wrong.

The "OpenAI is secretly calling out to a chess engine" hypothesis always seemed unlikely to me (you'd think it would play much better, if so), but it seemed the easiest solution (Occam's razor) and I wouldn't have been surprised to learn it was true (it's not like OpenAI has a reputation of being trustworthy).

og_kalu 1 day ago [-]

>but it seemed the easiest solution (Occam's razor)

In my opinion, it only seems like the easiest solution on the surface taking basically nothing into account. By the time you start looking at everything in context, it just seems bizarre.

slibhb 1 day ago [-]

I don't think it has anything to do with your logic here. Actually, people just like talking shit about OpenAI on HN. It gets you upvotes.

Legend2440 1 day ago [-]

LLM cynicism exceeds LLM hype at this point.

influx 1 day ago [-]

I wouldn't call delegating specialized problems to specialized engines cheating. While it should be documented, in a full AI system, I want the best answer regardless of the technology used.

bongodongobob 1 day ago [-]

That's not really how Occam's razor works. The entire company colluding and lying to the public isn't "easy". Easy is more along the lines of "for some reason it is good at chess but we're not sure why".

simonw 1 day ago [-]

One of the reasons I thought that was unlikely was personal pride. OpenAI researchers are proud of the work that they do. Cheating by calling out to a chess engine is something they would be ashamed of.

kibwen 1 day ago [-]

> OpenAI researchers are proud of the work that they do.

Well, the failed revolution from last year combined with the non-profit bait-and-switch pretty much conclusively proved that OpenAI researchers are in it for the money first and foremost, and pride has a dollar value.

fkyoureadthedoc 1 day ago [-]

How much say do individual researchers even have in this move?

And how does that prove anything about their motivations "first and foremost"? They could be in it because they like the work itself, and secondary concerns like open or not don't matter to them. There's basically infinite interpretations of their motivations.

dogleash 1 day ago [-]

> The entire company colluding and lying to the public isn't "easy".

Why not? Stop calling it "the entire company colluding and lying" and start calling it a "messaging strategy among the people not prevented from speaking by NDA." That will pass a casual Occam's test that "lying" failed. But they both mean the same exact thing.

TeMPOraL 1 day ago [-]

It won't, for the same reason - whenever you're proposing a conspiracy theory, you have to explain what stops every person involved from leaking the conspiracy, whether on purpose or by accident. This gets superlinearly harder with number of people involved, and extra hard when there are incentives rewarding leaks (and leaking OpenAI secrets has some strong potential rewards).

Occam's test applies to the full proposal, including the explanation of things outlined above.

ChrisArchitect 1 day ago [-]

Related from last week:

Something weird is happening with LLMs and Chess

https://news.ycombinator.com/item?id=42138276

amelius 5 hours ago [-]

I wonder what would happen if they changed the prompt such that the llm is asked to explain their strategy first. Or to explain their opponent's strategy.

torginus 12 hours ago [-]

Sorry - I have a somewhat question - is it possible to train models as instruct models straight away? Previously LLMs were trained on raw text data, but now we can generate instruct data directly either from 'teaching LLMs' or ask existing LLMs to conver raw data into instruct format.

Or alternatively - if chat tuning diminishes some of the models' capability, would it make sense to have a smaller chat model prompt a large base model, and convert back the outputs?

DHRicoF 12 hours ago [-]

I don't think there is enough (non syntetic) data available to get near what we are used to.

The big breakthrough of GPT was exactly that. You can train a model with (for what that time was) stupidly high amount of data and make it okis to a lot of task you haven't trained explicitly.

torginus 12 hours ago [-]

You can make GPT rewrite all existing textual info into chatbot format, so there's no loss there.

With newer techniques, such as chain of thought and self-checking, you can also generate a ton of high-quality training data, that won't degrade the output of the LLM. Though the degree to which you can do that is not clear to me.

Imo it makes sense to train an LLM as a chatbot from the start.

kqr 18 hours ago [-]

I get that it would make evals even more expensive, but I would also try chain-of-thought! Have it explain its goals and reasoning for the next move before making it. It might be an awful idea for something like chess, but it seems to help elsewhere.

amrrs 1 day ago [-]

>Theory 1: Large enough base models are good at chess, but this doesn’t persist through instruction tuning to chat models.

I lean mostly towards this and also the chess notations - not sure if it might get chopped during tokenization unless it's very precisely processed.

It's like designing an LLM just for predicting protein sequence because the sequencing matters. The base data might have it but i don't think that's the intention for it to continue.

com2kid 1 day ago [-]

This makes me wonder what scenarios would be unlocked if OpenAI gave access to gpt4-instruct.

I wonder if they avoid that due to the potential for negative press from the outputs of a more "raw" model.

furyofantares 1 day ago [-]

LLMs are fundamentally text-completion. The Chat-based tuning that goes on top of it is impressive but they are fundamentally text-completion, that's where most of the training energy goes. I keep this in mind with a lot of my prompting and get good results.

Regurgitating and Examples are both ways to lean into that and try to recover whatever has been lost by Chat-based tuning.

zi_ 1 day ago [-]

what else do you think about when prompting, which you've found to be useful?

joshka 1 day ago [-]

Why are you telling it not to explain? Allowing the LLM space to "think" may be helpful, and would be definitely worth explorying?

Why are you manually guessing ways to improve this? Why not let the LLMs do this for themselves and find iteratively better prompts?

blixt 1 day ago [-]

Really interesting findings around fine-tuning. Goes to show it doesn't really affect the deeper "functionality" of the LLM (if you think of the LLM running a set of small functions on very high-dimensional numbers to produce a token).

Using regurgitation to get around the assistant/user token separation is another fun tool for the toolbox, relevant for whenever you want a model that doesn't support continuation actually perform continuation (at the cost of a lot of latency).

I wonder if any type of reflection or chains of thought would help it play better. I wouldn't be surprised if getting the LLM to write an analysis of the game in English is more likely to move it out of distribution than to make it pick better chess moves.

phkahler 1 day ago [-]

You can easily construct a game board from a sequence of moves by maintaining the game state somewhere. But you can also know where a piece is bases on only its last move. I'm curious what happens if you don't feed it a position, but feed it a sequence of moves including illegal ones but end up at a given valid position. The author mention that LLMs will play differently when the same position is arrived at via different sequences. I'm suggesting to really play with that by putting illegal moves in the sequence.

I doubt it's doing much more than a static analysis of the a board position, or even moving based mostly on just a few recent moves by key pieces.

boesboes 14 hours ago [-]

It would be interesting to see if it can also play chess with altered rules, or actually just a novel 'game' that relies on logic & reasoning. Still not sure if that would 'prove' LLMs do reasoning, but I'd be pretty close to convinced.

Miraltar 14 hours ago [-]

If they were trained on multiple chess variants that might work but as is it's impossible I think. Their internal model to play chess is probably very specific

blueboo 14 hours ago [-]

Fun idea. Let’s change how the knight behaves. Or try it on Really Bad Chess (puzzles with impossible layouts) or 6x6 chess or 8x9 chess.

I wonder if there are variants that have good baselines. It might be tough to evaluate vis a vis human performance on novel games..

koolala 7 hours ago [-]

Next test a image & text model! Chess is way easier when you can see the board.

gallerdude 1 day ago [-]

Very interesting - have you tried using `o1` yet? I made a program which makes LLM's complete WORDLE puzzles, and the difference between `4o` and `o1` is absolutely astonishing.

simonw 1 day ago [-]

OK, that was fun. I just tried o1-preview on today's Wordle and it got it on the third guess: https://chatgpt.com/share/673f9169-3654-8006-8c0b-07c53a2c58...

gallerdude 1 day ago [-]

4o-mini: 16% 4o: 50% o1-mini: 97% o1: 100%

* disclaimer - only n=7 on o1. Others are like 100-300 each

copperroof 3 hours ago [-]

I just want a hacker news no-LLM filter. The site has been almost unusable for a year now.

cma 2 hours ago [-]

One thing missing from the graphs is whether 3.5-turbo-instruct also gets better with the techniques? Is finetuning available for it?

leumassuehtam 14 hours ago [-]

I'm convinced that "completion" models are much more useful (and smart) than "chat" models, being able to provide more nuanced and original outputs. When gpt4 come out, text-davinci-003 would still provide better completions with the correct prompt. Of course this model was later replaced by gpt-3.5-turbo-instruct which is explored in this post.

I believe the reason why such models were later deprecated was "alignment".

keskival 10 hours ago [-]

"I’m not sure, because OpenAI doesn’t deign to share gpt-4-base, nor to allow queries of gpt-4o in completion mode."

I would guess GPT-4o isn't first pre-trained and then instruct-tuned, but trained directly with refined instruction-following material.

This material probably contains way fewer chess games.

toxik 10 hours ago [-]

Why do you think that? InstructGPT was predominantly trained as a next-token predictor on whatever soup of data OpenAI curated at the time. The alignment signal (both RL part and the supervised prompt/answer pairs) are a tiny bit of the gradient.

deadbabe 11 hours ago [-]

If you randomly position pieces on the board and then ask the LLM to play chess, where each piece still moves according to its normal rules, does it know how to play still?

Palmik 16 hours ago [-]

It might be worth trying the experiment where the prompt is formatted such that each chess turn corresponds to one chat message.

qnleigh 13 hours ago [-]

Two other theories that could explain why OpenAI's models do so well:

1. They generate chess games from chess engine self play and add that to the training data (similar to the already-stated theory about their training data).

2. They have added chess reinforcement learning to the training at some stage, and actually got it to work (but not very well).

bambax 1 day ago [-]

Very good follow-up to the original article. Thank you!

byyoung3 16 hours ago [-]

sometimes new training techniques will lead to regressions in certain tasks. My guess is this is exactly what has happened.

GaggiX 12 hours ago [-]

You should not finetune the models on the strongest setting of Stockfish as the move will not be understandable unless you really dig deep into the position and the model would not be able to find a pattern to make sense of it, instead I suggest training on human games of a certain ELO (less than grandmaster).

sourcepluck 1 day ago [-]

I don't like being directly critical, people learning in public can be good and instructive. But I regret the time I've put into both this article and the last one and perhaps someone else can be saved the same time.

This is someone with limited knowledge of chess, statistics and LLMs doing a series of public articles as they learn a little tiny bit about chess, statistics and LLMs. And it garners upvotes and attention off the coat-tails of AI excitement. Which is fair enough, it's the (semi-)public internet, but it sort of masquerades as being half-serious "research", and it kind of held things together for the first article, but this one really is thrown together to keep the buzz going of the last one.

The TL;DR :: one of the AIs being just-above-terrible, compared to all the others being completely terrible, a fact already of dubious interest, is down to - we don't know. Maybe a difference in training sets. Tons of speculation. A few graphs.

MisterTea 1 day ago [-]

This happened to a friend who was trying to sim basketball games. It kept forgetting who had the ball or outright made illegal or confusing moves. After a few days of wrestling with the AI he gave up. GPT is amazing at following a linear conversation but had no cognitive ability to keep track of a dynamic scenario.

seizethecheese 1 day ago [-]

All the hand wringing about openAI cheating suggests a question: why so much mistrust?

My guess would be that the persona of the openAI team on platforms like Twitter is very cliquey. This, I think, naturally leads to mistrust. A clique feels more likely to cheat than some other sort of group.

simonw 1 day ago [-]

I wrote about this last year. The levels of trust people have in companies working in AI is notably low: https://simonwillison.net/2023/Dec/14/ai-trust-crisis/

nuancebydefault 1 day ago [-]

My take on this is that people tend to be afraid of what they can't understand or explain. To do away with that feeling, they just say 'it can't reason'. While nobody on earth can put a finger on what reasoning is, other than that it is a human trait.

atemerev 1 day ago [-]

Ah, half of the commentariat still think that “LLMs can’t reason”. Even if they have enough state space for reasoning, and clearly demonstrate that.

sourcepluck 22 hours ago [-]

Most people, as far as I'm aware, don't have an issue with the idea that LLMs are producing behaviour which gives the appearance of reasoning as far as we understand it today. Which essentially means, it makes sentences that are gramatical, responsive and contextual based on what you said (quite often). It's at least pretty cool that we've got machines to do that, most people seem to think.

The issue is that there might be more to reason than appearing to reason. We just don't know. I'm not sure how it's apparently so unknown or unappreciated by people in the computer world, but there are major unresolved questions in science and philosophy around things like thinking, reasoning, language, consciousness, and the mind. No amount of techno-optimism can change this fact.

The issue is we have not gotten further than more or less educated guesses as to what those words mean. LLMs bring that interesting fact to light, even providing humanity with a wonderful nudge to keep grappling with these unsolved questions, and perhaps make some progress.

To be clear, they certainly are sometimes passably good when it comes to summarising selectively and responsively the terabytes and terabytes of data they've been trained on, don't get me wrong, and I am enjoying that new thing in the world. And if you want to define reason like that, feel free.

og_kalu 10 hours ago [-]

If it displays the outwards appearances of reasoning then it is reasoning. We don't evaluate humans any differently. There's no magic intell-o-meter that can detect the amount of intelligence flowing through a brain.

Anything else is just an argument of semantics. The idea that there is "true" reasoning and "fake" reasoning but that we can't tell the latter apart from the former is ridiculous.

You can't eat your cake and have it. Either "fake reasoning" is a thing and can be distinguished or it can't and it's just a made up distinction.

atemerev 16 hours ago [-]

LLMs can _play chess_. With the game positions previously unseen. How’s that not actual logical reasoning?

sourcepluck 13 hours ago [-]

I guess you don't follow TCEC, or computer chess generally[0]. Chess engines have been _playing chess_ at superhuman levels using neural networks for years now, it was a revolution in the space. AlphaZero, Lc0, Stockfish NNUE. I don't recall yards of commentary arguing that they were reasoning.

Look, you can put as many underscores as you like, the question of whether these machines are really reasoning or emulating reason is not a solved problem. We don't know what reasoning is! We don't know if we are really reasoning, because we have major unresolved questions regarding the mind and consciousness[1].

These may not be intractable problems either, there's reason for hope. In particular, studying brains with more precision is obviously exciting there. More computational experiments, including the recent explosion in LLM research, is also great.

Still, reflexively believing in the computational theory of the mind[2] without engaging in the actual difficulty of those questions, though commonplace, is not reasonable.

[0] Jozarov on YT has great commentary of top engine games, worth checking out.

[1] https://plato.stanford.edu/entries/consciousness/

[2] https://plato.stanford.edu/entries/computational-mind/

lottin 1 day ago [-]

"The question of whether a computer can think is no more interesting than the question of whether a submarine can swim." - Edsger Dijkstra

brookst 1 day ago [-]

But it's not real reasoning because it is just outputting likely next tokens that are identical to what we'd expect with reasoning. /s

drivingmenuts 1 day ago [-]

Why would a chess-playing AI be tuned to do anything except play chess? Just seems like a waste. A bunch of small, specialized AI's seems like a better idea than spending time trying to build a new one.

Maybe less morally challenging, as well. You wouldn't be trying to install "sentience".

OutOfHere 1 day ago [-]

I don't know why this whole line of posts is worthy of the front page. They seem like one's personal experiments in a limited capacity, unworthy of sharing. It is obvious the observed outputs are because instruction tuning is incompatible with the prompt used by the user. Secondly, the user even failed to provide a chess board diagram (represented as text) to the model. The user also failed to tune any models. Overall, in the absence of an ascii diagram, it's all a waste of time.

synarchefriend 1 day ago [-]

The model was trained on games in PGN notation. It would be shocking if it found ASCII art easier to understand than what it was actually trained on.

OutOfHere 1 day ago [-]

Well, clearly you're not interested in experimentation, only in assumptions.

daveguy 1 day ago [-]

How does stating the outcome you expect imply you are not interested in experimentation? Hypothesis formation is the very first step in experimentation.

danielmarkbruce 1 day ago [-]

Most people who understand LLMs and how they are trained would be shocked. In practice, that's an objectively true statement.

BeetleB 1 day ago [-]

Please, please show us your experiments.

OutOfHere 1 day ago [-]

I am not the one writing and posting useless articles, even harmful articles, also distorting the understanding of LLMs. Ask the ones who do to perform better experiments.

multjoy 1 day ago [-]

You know that the LLM isn't actually your friend, don't you?

BeetleB 1 day ago [-]

So to quote yourself:

> Well, clearly you're not interested in experimentation

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact