this post was submitted on 28 Nov 2023
        
      
      1 points (100.0% liked)
      LocalLLaMA
    11 readers
  
      
      4 users here now
      Community to discuss about Llama, the family of large language models created by Meta AI.
        founded 2 years ago
      
      MODERATORS
      
    you are viewing a single comment's thread
view the rest of the comments
    view the rest of the comments
This can kinda be done, but it’s not as simple as just that. You would need to also infer in many cases the prompt templates. Also many/most benchmarks are designed with untuned models in mind, meaning you typically need to add a system prompt/instructions… doing that also adds complexity because the best prompt for one model is likely different from the next. Also chat vs instruct vs base models in the same eval would be… meh. That said I think there is value in this and working on it as part of my cli tool with some warnings that the results might be less then quantitative