this post was submitted on 18 Nov 2023
1 points (100.0% liked)
LocalLLaMA
3 readers
1 users here now
Community to discuss about Llama, the family of large language models created by Meta AI.
founded 1 year ago
MODERATORS
you are viewing a single comment's thread
view the rest of the comments
view the rest of the comments
Open-ended question are the best for evaluating LLM, because they require common sense/world knowledge/doxa/human like behavior.
Saying "I don't know" is just a cop out response. At least it should say something like "It could be X but ...", be a little creative.
Another (less?) open-ended question with the same premise would be "Where are they?" and I expect the answer to be "In a garden".
GPT-4 Turbo (with custom instruction) answer very well https://chat.openai.com/share/c305568e-f89e-4e71-bb97-79f7710c441a
Thank you, bud Mind trying the same prompt on the cheapo 3.5 model? I suspect it will hit it on the nail with your custom instructions, given that it was hit and miss for me with my weaker prompting judjitsu
3.5 never suspect the 6th playing chess
https://chat.openai.com/share/b7e6b24d-44db-4abf-9a81-5325f836bca5 (the === are artifacts of the custom system prompt, 3.5 sucks at following it)
I asked it for candidate activity, and mostly offered different ones. It's weird, I would expect a LLM to list activities that were already mentioned in the conversation. Maybe the repetition penalty is set too high?
Perhaps there's a language barrier here, but none of those activities hint to a garden? In my locale, a garden is a small patch used to grow veggies, herbs, and/or flowers. So I would answer this with "their back yard."
This is a much better riddle for children IMO, because it's barely open-ended at all. The original has almost infinite answers without any leaps or tricks, but yours has a very limited domain: a yard/garden. Though if someone were extra clever, the problem space does open back to nearly infinity (if brother 4 is playing a video game).
For personal testing, that's certainly a valid opinion! But it's not very productive from an objective standpoint because it can't be graded and tests a "gotcha" path of thinking, when we're still focusing on fundamentals like uniform context attention, consistency over time, etc.