this post was submitted on 23 Nov 2023
1 points (100.0% liked)

LocalLLaMA

3 readers
1 users here now

Community to discuss about Llama, the family of large language models created by Meta AI.

founded 1 year ago
MODERATORS
 

I have 10k vulnerabilities found in around 100 C++ projects. For the culture I would like to try to train an LLM to, given a file, to highlight the vulnerabilities. Each vulnerability report contains:

  • a title and a description
  • a link to either a file or a particular line of the file (or more!)

I'm just thinking about it but I wonder how would I build the dataset. Ideally I would go by pairing the file concerned by the issue and the report. But AFAI understand the context window won't allow me to put a 300ish long file with a 1k characters vulnerability report. Even if the context window wouldn't be an issue the problem would be that multiple vulnerability reports be in the same file.

So maybe pairing on file with a list of vulnerabilities summaries and their lines would do the trick.

Just thinking out loud here. How would you do it? Am I missing something obvious?

you are viewing a single comment's thread
view the rest of the comments
[–] _Lee_B_@alien.top 1 points 11 months ago

Probably the line number, range on the line, the CWE ID, and, to help the AI understand and link the CWE to the code, the description from the CWE too.