LocalLLaMA

11 readers

4 users here now

Community to discuss about Llama, the family of large language models created by Meta AI.

founded 2 years ago

MODERATORS

communick@poweruser.forum

Using Mistral Openorca to create a knowledge graph from a text document (towardsdatascience.com)

submitted 2 years ago by WaterdanceAC@alien.top to c/localllama@poweruser.forum

23 comments fedilink hide all child comments

you are viewing a single comment's thread
view the rest of the comments

[–] Inkbot_dev@alien.top 1 points 2 years ago

I'll give you some better examples, just didn't have time right then. Give me a few.

It was trained on a whole bunch of prompts asking for each task, so it's not reliant on the exact wording from one of them in training to work. Set the task in the meta section as "kg", and the model will respond with a knowledge graph if you ask for one (and sometimes if you don't).

Here are a few of them:

Create a Knowledge Graph based on the provided document.

Create a Knowledge Graph based on the details in the conversation.

Your task is to construct a comprehensive Temporal Knowledge Graph

1. Read and understand the Document: Familiarize yourself with the essential elements, including (but not limited to) ideas, events, people, organizations, impacts, and key points, along with any explicitly mentioned or inferred dates or chronology

	- Pretend the date found in 'Date written' is the current date

	- Create an inferred chronology (e.g., "before the car crash" or "shortly after police arrived") when exact dates or times are not available



2. Create Nodes: Designate each of the essential elements identified earlier as a node with a unique ID using random letters from the greek alphabet. Populate each node with relevant details.



3. Establish and Describe Edges: Determine the relationships between nodes, forming the edges of your knowledge graph. For each edge:

	- Specify the nodes it connects

	- Describe the relationship and its direction

	- Assign a confidence level (high, medium, low) indicating the certainty of the connection
    
    
4. Represent All Nodes: Make sure all nodes are included in the edge list

I haven't noticed a huge difference in the outcome at inference time depending on prompt used, but sprinkling in some more detailed instructions helped lower loss when training.

As far as dataset, I used a little bit of the Dolphin dataset, to not lose the usual conversational ability. A little bit of the SponsorBlock dataset as a seed, and then I improved it, and the rest is custom...I spent ~$1k or so on API calls creating it. I plan on releasing it at some point, but I want to improve some aspects of it first.

Total dataset size I used for training is ~85mb.