ResearchTLDR

joined 11 months ago
[โ€“] ResearchTLDR@alien.top 1 points 9 months ago (1 children)

I would also be interested in this! Especially if we could create new custom evals and load them in.

 

I want to create a fine-tuning dataset that I can use with several models through Axolotl (like Mistral, Llama 2, and Falcon) to improve the model's ability to extract requested information from a paragraph and output that in JSON format. I am using lm-format-enforcer to force JSON output.

Here is an example of the type of prompt I have been trying so far:

<s>[INST] &lt;>

You are a helpful, respectful and honest assistant.

&lt;>

Please give me information about this call log. If you cannot find the information you need, put N/A for that field. Any apostrophe must be escaped with a \ character. You MUST answer using the following json schema: {"properties":{"company_name":{"title":"Company Name","type":"string"},"country_or_countries":{"title":"Country Or Countries","type":"string"},"total_amount_due":{"title":"Total Amount Due","type":"integer"},"pending_task":{"title":"Pending Task","type":"boolean"}},"required":["company_name","country_or_countries","total_amount_due","pending_task"],"title":"AnswerFormat","type":"object"} Call log 2023-10-01 11:50:30 talked with Jim at Acme Construction. The job in Toronto is held up waiting for our sign off on the contract. We also need to put in a down payment of $10,000 plus the inspector fee of $750, and a license fee of $1500. The payment must be made by the end of November. I told him I'll call him back when it's done. [/INST]

Expected output:

{"company_name":{"title":"Company Name","value":"Acme Construction"},"country_or_countries":{"title":"Country Or Countries","value":"Canada"},"total_amount_due":{"title":"Total Amount Due","value":"12250"},"pending_task":{"title":"Pending Task","value":"TRUE"}}

I'm looking for some tips from people with more experience in prompt engineering. Here are some of my main questions:

โ€‹

  1. Is this format with SYS and INST a reasonable idea for formatting a fine-tuning dataset? Especially since I want to fine-tune different base models with this same training dataset, do I need to strip some of that formatting out of the fine-tuning dataset to keep it more "model format neutral" and then add the formatting back in somehow for each model? What's the best practice for fine-tuning dataset formatting in this regard?

  2. Should I even include a SYS message at all? Instead of the generic "You are a...assistant", should I uses the SYS section to specify that I want the results in JSON format? Or should I just remove the SYS section?

  3. The goal with the fine-tuning dataset is to have at least one thousand examples of prompts and outputs in JSON format, but I want to vary the type of text being analyzed and the JSON schema in the examples to help it generalize better. In other words, I won't always ask it to find the same information in the text and I won't always use the same type of text. Should I also vary the way I write up the initial part about if you can't find the info then write N/A and that sort of stuff?

Thanks for reading and if you need to clarify anything, to hesitate to ask.

 

I have Portainer running several self-hosted apps, but I am having a hard time getting a recipe and food planner app like Mealie or Tandoor to work. I'm sure some of you have gotten this to work, and I'm hoping you'll share how you do it.

Some context: I have nothing exposed to the Internet, and I just have a Wireguard VPN set up on my phone and laptop for if I want to access my self-hosted apps while away from my home. All the docs and examples I can find for Mealie and Tandoor assume that I am exposing them to the Internet in some way, and that is not my use case.

I access my self-hosted apps via their IP and port number (and I have them organized in a dashboard for ease of use.) I know this is not the most common way, but I know others do this, too. So that's why I am asking for some of you lovely people to share your docker compose files for self-hosting Mealie or Tandoor with no domain name, proxy manager, etc. Just connection via IP address and port number from the local network.

P.S. I am fine with either using SQLite to avoid needing a seperate database, or including something like Postgres inside the same Docker Compose file, either way.