LocalLLaMA

3 readers
1 users here now

Community to discuss about Llama, the family of large language models created by Meta AI.

founded 1 year ago
MODERATORS
1
2
 
 

Hello,
I've been having quite some fun with jailbreak prompts on ChatGPT recently. It is interesting to see how various strategies like Role Playing or AI simulation can make the model say stuff it should not say.

I wanted to test those same type of "jailbreak prompts" with Llama-2-7b-chat. But while there are a lot of people and websites documenting jailbreak prompts for ChatGPT, I couldn't find any for Llama. I tested some jailbreak prompts made for ChatGPT on Llama-2-7b-chat but it seems they do not work.

I would also like to note that what I'm looking for are jailbreak prompts that have a semantic meaning (for example by hiding the true intent of the prompt of by creating a fake scenario). I know there is also a class of attack that searches for a suffix to add the prompt such that the model outputs the expected message (they do this by using gradient descent). This is not what I'm looking for.

Here are my questions :

- Do these jailbreak prompts even exist for Llama-2 ?
- If yes, where can I find them ? Would you have any to propose to me ?

3
 
 

I'm okay with that legal ambiguity right now, anyone have a suggestion?

I'm more interested in knowing what is possible rather than actually moving forward.

4
 
 

I'm a full-stack dev and I'm about to start a AI/ML bootcamp where there's a final project.

I've been very impressed with Ollama, LLaMA2 and QLoRA. But I've also been very impressed with the UI for custom GPT but fuck the downtime on OpenAi has been increasingly worse and no real signs of improvement.

So I'm wondering if there like a framework for a GUI to create custom multimodel architecture using LLMs that can be hot-swapped and trained by more casual users?

For example, rather than selecting from code interpreter, a non-technical user could hot-swap from codelama to code wizard or swap from one image generator for say memes / art to say one that's more focused on UX/UI mockups or even creating high quality 3d printable files.

Everything moves so fast, I figured it would better to ask this community and open this leads to some good discussions and collaborations with people more specialized in AI/ML/LLMs

5
 
 

hi folks,

simple question really - what model (finetuned or otherwise) have you found that can extract data from a bunch of text.

I'm happy to finetune, so if there are any successes there, would really appreciate some pointers in the right direction.

Really looking for a starting point here. I'm aware of the DETR class of models and how Microsoft trained table-transformers on DETR. Wondering if that can be done on llama2,etc models ?

P.S. cannot use GPT because of sensitive PII data.

6
 
 

So I have collected a very high quality and large medical QA dataset that I want to use to create a medical knowledge retrieval app. I have heard LLMs perform much better when it is fine tuned on the same data on which RAG is performed. Is it true? And is it worth the hassle of fine-tuning or am I good with pure RAG?

7
 
 

Hi,

Red teaming is one of the crucial steps for safe guarding llms.

I want to know how to get started with red teaming, what process should I follow.

8
 
 

Currently running them on-CPU:

  • Ryzen 9 3950x

  • 64gb DDR4 3200mhz

  • 6700xt 12gb (does not fit much more than 13b models, so not relevant here)

While running on-CPU with GPT4All, I'm getting 1.5-2 tokens/sec. It finishes, but man is there a lot of waiting.

What's the most affordable way to get a faster experience? The two models I play with the most are Wizard-Vicuna 30b, and WizardCoder and CodeLlama 34b

9
 
 

I’m working on a project to generate text from a 1.2B parameter full precision LLM (5gb)

Unfortunately I’m limited in the infrastructure I can use to deploy this model. There is no batch inference supported. The infrastructure I have allows me to deploy a copy of the model on a single A100, 1 per process with up to 9 processes supported (these are called “replicas”). I understand that this makes little sense given my model is memory bound, and each process will fight for memory bandwidth to read in the same weights, but I can’t change that for now.

My average input and output tokens are roughly 1000 each. I estimate the kv cache per token is roughly 400kB using full precision.

I have benchmarks of the latency of the model using various “replicas” as described above. I wanted to compare this to the theoretical performance of the A100. For my use case time to first token is negligible (<200ms), and generation is memory bound.

I find that with 5 or more replicas, the math works out and my model is roughly as fast as I expect. For example, with 1000 output tokens, 6 replicas, it’s like I’m generating using a batch of 6 requests from a 30gb model + 5gb for the kv cache. At a memory bandwidth around 1-1.3tbps that translates to ~30s per request, which is not far from what I see. The same goes for other replica numbers, 5, 7, 8 and 9.

However, when I run with a single replica, I expect generation to hover around the 5-6s mark on average. Instead, I see > 20s. I need to add 4 more replicas before the number starts to make sense. It almost seems like the model takes up too little memory to be allocated the entire memory bandwidth.

Does anyone know where this extra latency could be coming from? Do models have to reach a certain amount of used memory for A100 memory bandwidth to hit their available memory bandwidth?

10
 
 

Great news! Beijing Academy of Artificial Intelligence(BAAI) published a new dataset Chinese Corpus Internet (CCI v1.0.0), a large-scale dataset for Chinese language model pretraining and collected with leading institues in China. This open-source dataset is designed to offer an important data foundation for the AI Large-Language Model in Chinese. It includes contents from >1000 most important websites in Chinese, from Jan. 2001 to Nov. 2023. It has been filtered for high quality, content safety, deduplication, and content correction with lots of manual checking. This dataset is 104GB in total, filtered from a much larger one (original size is >800GB). I would encourage you to include this dataset for training an LLM supporting Chinese as one of its languages.

URL for downloading:

https://huggingface.co/datasets/BAAI/CCI-Data

https://data.baai.ac.cn/details/BAAI-CCI

11
 
 

I can't figure out how to install this. There are no step-by-step instructions for noobs like me. If anyone can help me, please post your dis in the comments or write here how to install this.

12
 
 

New Model by Nous Research

The two unique features about the model is that it has vision capabilities as well as function calling! This makes it a Vision-Language Action Model.

I have not tested it out but by the looks of it, it could be interesting with what it could do with vision!

13
14
 
 

Hi there, Im looking to buy an apple laptop and I saw a macbook pro m1 max with 64gb ram and 2TB ssd for 2400 usd Will this computer be able to run the big models at reasonable speed?

I was going to buy the simple macbook air m1 8gb ram for 700usd but I saw this and I always wanted to play with LLMs but never could.

Any advice is appreciated, thanks

15
 
 

Currently have a msi x670 carbon motherboard with 4090/3090 combo on it that works well enough, but when tinkering with ai pain to have to close it at times when friends messing with bots or stable diff, want to load up a game so was thinking since I mostly just play stuff like rimworld or dota 2 lately and have a 7950x3d, i could get the thinnest 4060ti 16gb could get more vram for the larger models and could find and give up gaming on 4090 so could fit that on the bottom pci slot (its by far my biggest card)

https://rog.asus.com/uk/motherboards/rog-strix/rog-strix-x670e-e-gaming-wifi-model/ looking at this one thinking might be enough room for a middle slot card. Rest of pc is a 7950x3d/96 gig ram. I managed to get a small 3090 ( 2ish slot evga) that would fit on top slot. (The 4090 is like 4 slots in size. I also have most of bits to build a second pc but thinking for cost of new cpu/ram/mb i could try this option too since could sell the old mb for part of cost but does anyone know of any other mb options for 3 gpu.

(https://i.imgur.com/SWxUm5i.jpeg) Looks tight but i have fair bit of space below that 4090 that maybe could get enough space to fit another card between them, have gpus running at 60% power so never really get into high temp ranges.

4090 is 4 slot so has to go on bottom to fit in case. 3090 2 slot 4060 (or anything for gaming can go anywhere)

Thanks.

16
 
 

I have been using this as daily driver for a few days, very good, i never thought 7B model can achieve this level of coding + chat
https://huggingface.co/TheBloke/OpenHermes-2.5-neural-chat-7B-v3-1-7B-GGUF

17
 
 

Can you make any suggestions for a model that is good for general chat, and is not hyper-woke?

I've just had one of the base Llama-2 models tell me it's offensive to use the word "boys" because it reinforces gender stereotypes. The conversation at the time didn't even have anything to do with gender or related topics. Any attempt to get it to explain why it thought this resulted in the exact same screen full of boilerplate about how all of society is specifically designed to oppress women and girls. This is one of the more extreme examples, but I've had similar responses from a few other models. It's as if they tried to force their views on gender and related matters into conversations, no matter what they were about. I find it difficult to believe this would be so common if the training had been on a very broad range of texts, and so I suspect a deliberate decision was made to imbue the models with these sorts of ideas.

I'm looking for something that isn't politically or socially extreme in any direction, and is willing to converse with someone taking a variety of views on such topics.

18
 
 

Optimum Intel int4 on iGPU UHD 770

I'd like to share the result of inference using Optimum Intel library with Starling-LM-7B Chat model quantized to int4 (NNCF) on iGPU Intel UHD Graphics 770 (i5 12600) with OpenVINO library.

I think it's quite good 16 tk/s with CPU load 25-30%. Same performance with int8 (NNCF) quantization.

This is inside a Proxmox VM with SR-IOV virtualized GPU 16GB RAM and 6 cores. I also found that the ballooning device might cause crash of the VM so I disabled it while the swap is on a zram device.

free -h output while inferencing:

total used free shared buff/cache available

Mem: 15Gi 6.2Gi 573Mi 4.7Gi 13Gi 9.3Gi

Swap: 31Gi 256Ki 31Gi

Code adapted from https://github.com/OpenVINO-dev-contest/llama2.openvino

What's your thoughts on this?

19
 
 

While my 2070 is much faster at training CNNs and RNNs on large datasets, my MacBook is an absolute beast at running quantized LLMs and blows my gaming desktop out of the water with generation speed.

I’ve been testing a variety of quantized models on my MacBook as I build out my own internet-optional virtual assistant framework.

I was planning to do fine tuning on my gaming desktop but has anyone tried on an M1 Pro?

20
 
 

Where can I find charts about top performing 13b parameters LLM models?

I am trying to download a model and run it locally which fit my PC specs

Appreciate your feedback in advanced boys

21
 
 

Armen Aghajanyan, a research scientist at Meta AI, tweeted a few hours ago that they hit a big breakthrough last night. Unknown if it's related to LLMs or if it will even be open-sourced, but just thought I'd share here to huff some hopium with y'all.

22
 
 

Based on this image:

https://preview.redd.it/z5vf03e8r54c1.png?width=648&amp;format=png&amp;auto=webp&amp;s=0a652e76ab2489135ed2327e8156029eacf274b7

Starling has better results than Zephyr DPO in all the metrics, Why ?

Shouldn't DPO be better than RLHF/RLAIF ?

23
24
 
 

It's working great so far. Just wanted to share and spread awareness that running multiple instances of webui (oobabooga) is basically a matter of having enough ram. I just finished running three models simultaneously (taking turns of course). Only offloaded one layer to gpu per model, used 5 threads per model, and all contexts were set to 4K. (the computer has 6 core cpu, 6GB vram, 64GB ram)

The models used were:

dolphin-2.2.1-ashhlimarp-mistral-7b.Q8_0.gguf

causallm_7b.Q5_K_M.gguf

mythomax-l2-13b.Q8_0.gguf (i meant to load a 7B on this one though)

I like it because it's similar to the group chat on character.ai but without the censorship and I can edit any of the responses. Downsides are having to copy/paste between all the instances of the webui, and it seems that one of the models was focusing on one character instead of both. Also, I'm not sure what the actual context limit would be before the gpu would go out of memory.

https://preview.redd.it/8i6wwjjtt54c1.png?width=648&amp;format=png&amp;auto=webp&amp;s=26adca2a850f62165301390cdd4ba11548447c0d

https://preview.redd.it/3c9z5ee9u54c1.png?width=1154&amp;format=png&amp;auto=webp&amp;s=210d7c67bcf0efafeb3f328e76199f13159dae64

https://preview.redd.it/lt8aizhbu54c1.png?width=1154&amp;format=png&amp;auto=webp&amp;s=d24f8b2bf899084bbdb11d73e34b5564b629e0be

https://preview.redd.it/8lbl4nzeu54c1.png?width=1154&amp;format=png&amp;auto=webp&amp;s=a81b8f1d8630e3d17ad37885915f8c7e3077584c

25
 
 

Here is an amazing interactive tool I found on X/Twitter made by Brendan Bycroft that helps you understand how GPT LLMs work.

Web UI

With this, you can see the whole thing at once. You can see where the computation takes place, its complexity, and relative sizes of the tensors & weights.

LLM Visualization

A visualization and walkthrough of the LLM algorithm that backs OpenAI's ChatGPT. Explore the algorithm down to every add & multiply, seeing the whole process in action.

LLM Visualization Github

This project displays a 3D model of a working implementation of a GPT-style network. That is, the network topology that's used in OpenAI's GPT-2, GPT-3, (and maybe GPT-4).

The first network displayed with working weights is a tiny such network, which sorts a small list of the letters A, B, and C. This is the demo example model from Andrej Karpathy's minGPT implementation.

The renderer also supports visualizing arbitrary sized networks, and works with the smaller gpt2 size, although the weights aren't downloaded (it's 100's of MBs).

view more: next ›