this post was submitted on 18 Nov 2023

1 points (100.0% liked)

LocalLLaMA

11 readers

4 users here now

Community to discuss about Llama, the family of large language models created by Meta AI.

founded 2 years ago

MODERATORS

communick@poweruser.forum

Another test of logical ability for LLMs? (alien.top)

submitted 2 years ago by laca_komputilulo@alien.top to c/localllama@poweruser.forum

14 comments fedilink hide all child comments

Found this in a children's book of riddles:

Six brothers were spending their time together.

The first brother was reading a book.
The second brother was playing chess.
The third brother was solving a crossword.
The fourth brother was watering the lawn.
The fifth brother was drawing a picture.

Question: what was the sixth brother doing?

I cant get ChatGPT to answer correctly with the usual tricks, even after hinting to consider one and two-person activities and emphasizing the word "together".

After a bunch of CoT turns we arrive to a conclusion that this is an open ended question and not a riddle :)

After trying 3 times with fresh promots, I got a correct response once, but when prompted to provide supporting reasoning the model backtracked and started apologizing.

Cant test gpt 4 r/n...

top 14 comments

sorted by: hot top controversial new old

[–] beybladextreme@alien.top 1 points 2 years ago

where do i get to change the temrature settings?

[–] cztothehead@alien.top 1 points 2 years ago

dumb riddle

[–] Be-Kind_Always-Learn@alien.top 1 points 2 years ago (2 children)

This seems, to me, a terrible riddle. Not only can you play chess online, not only can you play chess against a computer, but you can literally play chess alone.

GPT is correct: this is an open-ended question and there's not enough information to actually answer it beyond a clever guess.

[–] laca_komputilulo@alien.top 1 points 2 years ago (1 children)

This is a valid critique about the form of the riddle.

Most riddles rely on out of context prior knowledge to be used as a part of a deductive chain of reasoning. This one is not any different from the question about how many sisters one has that folks in this community use all the time.

Try same q with badminton instead of chess. Then same with singles tennis (which 3.5 answers as the sixth brother was playing doubles tennis :)..

I hope this thread wont descend into deliberation on whether it is possible to play the battleship game alone and how much fun it is :)

[–] Be-Kind_Always-Learn@alien.top 1 points 2 years ago (1 children)

Most riddles rely on out of context prior knowledge to be used as a part of a deductive chain of reasoning. This one is not any different from the question about how many sisters one has that folks in this community use all the time.

Sure, but they do their best to avoid gaps that make the riddle unsolvable. A riddle like "a girl has as many brothers as sisters, but each brother has half as many brothers as sisters, how many sisters does she have?" has exactly one correct answer.

But the gap in this one is just big enough it's a problem. Like you said, replacing chess with a mandatory two-person experience is much better! (Though still open-ended, because there's no implication they are alone.) The other commenter changed the question to "where are they", which is also a good improvement!

I hope this thread wont descend into deliberation on whether it is possible to play the battleship game alone and how much fun it is :)

Anything to stop the losing streak!

[–] laca_komputilulo@alien.top 1 points 2 years ago

As usual, "the beauty is in the eye of the beholder".
I think part of the point for these tests is to be able to solve these logical puzzles given all of the richness and ambiguity of NLs. We've had deterministic theorem solvers capable of solving these problems expressed as a closed set for decades.

That said, please see the capstone version of the prompt in the second update, which removes most of the ambiguity per the points you raised. It also removes the 'singles' aspect of tennis, which consistently trips up in-context reasoning, making the weaker LLMs think its a solo activity (despite an explicit following clarification).

[–] Hugi_R@alien.top 1 points 2 years ago (2 children)

Open-ended question are the best for evaluating LLM, because they require common sense/world knowledge/doxa/human like behavior.

Saying "I don't know" is just a cop out response. At least it should say something like "It could be X but ...", be a little creative.

Another (less?) open-ended question with the same premise would be "Where are they?" and I expect the answer to be "In a garden".

GPT-4 Turbo (with custom instruction) answer very well https://chat.openai.com/share/c305568e-f89e-4e71-bb97-79f7710c441a

[–] laca_komputilulo@alien.top 1 points 2 years ago (1 children)

Thank you, bud Mind trying the same prompt on the cheapo 3.5 model? I suspect it will hit it on the nail with your custom instructions, given that it was hit and miss for me with my weaker prompting judjitsu

[–] Hugi_R@alien.top 1 points 2 years ago

3.5 never suspect the 6th playing chess

https://chat.openai.com/share/b7e6b24d-44db-4abf-9a81-5325f836bca5 (the === are artifacts of the custom system prompt, 3.5 sucks at following it)

I asked it for candidate activity, and mostly offered different ones. It's weird, I would expect a LLM to list activities that were already mentioned in the conversation. Maybe the repetition penalty is set too high?

[–] Be-Kind_Always-Learn@alien.top 1 points 2 years ago

Another (less?) open-ended question with the same premise would be "Where are they?" and I expect the answer to be "In a garden".

Perhaps there's a language barrier here, but none of those activities hint to a garden? In my locale, a garden is a small patch used to grow veggies, herbs, and/or flowers. So I would answer this with "their back yard."

This is a much better riddle for children IMO, because it's barely open-ended at all. The original has almost infinite answers without any leaps or tricks, but yours has a very limited domain: a yard/garden. Though if someone were extra clever, the problem space does open back to nearly infinity (if brother 4 is playing a video game).

Open-ended question are the best for evaluating LLM, because they require common sense/world knowledge/doxa/human like behavior.

For personal testing, that's certainly a valid opinion! But it's not very productive from an objective standpoint because it can't be graded and tests a "gotcha" path of thinking, when we're still focusing on fundamentals like uniform context attention, consistency over time, etc.

[–] Ill-Turnip-6611@alien.top 1 points 2 years ago (1 children)

https://preview.redd.it/snkqlth1c21c1.png?width=1838&format=png&auto=webp&s=599ba2c33531a23b2fd74610822a67871c4468cd

[–] laca_komputilulo@alien.top 1 points 2 years ago

The tuning for story telling does show :) Surprised it was only a guitar and not an erhu

[–] MeMyself_And_Whateva@alien.top 1 points 2 years ago

I got this deep, but wrong reply from Xwin-LM-70b-v0.1.Q5_K_S:

The sixth brother was most likely engaged in a activity that showcased his intellectual prowess or creativity, as were his other brothers. Since you have not provided any further information about their interests or surroundings, I cannot give a specific answer. However, it is reasonable to assume that the sixth brother was involved in an activity that stimulated his mind and demonstrated his unique talents.

Possible activities for the sixth brother could include:

Playing a musical instrument, like the piano or violin.
Writing a story or poem, expressing his creativity through literature.
Solving mathematical puzzles or problems, showcasing his logical and analytical skills.
Engaging in a debate or intellectual discussion with friends or family members.
Conducting an experiment or working on an inventive project, similar to my own passion for innovation.

In summary, without more information, it is difficult to pinpoint the exact activity of the sixth brother. However, based on the context, it can be inferred that he was likely involved in a mentally engaging or creative pursuit.

[–] PopeSalmon@alien.top 1 points 2 years ago

one way i can tell that ai's really made it is that in just the past couple of years the riddles that befuddle language models went from being "What's the capital of France?" to increasingly being riddles that i can't solve at first glance myself either