LocalLLaMA

1

OpenAI's nightmare: Deepseek R1 on a Raspberry Pi (www.youtube.com)

submitted 3 weeks ago by rglullis to c/localllama@poweruser.forum

0 comments fedilink

2

1

Build a Fully Local RAG App With PostgreSQL, Mistral, and Ollama (www.timescale.com)

submitted 4 months ago by rglullis to c/localllama@poweruser.forum

0 comments fedilink

3

1

Jailbreak prompts for Llama ? (alien.top)

submitted 1 year ago by zerokerim@alien.top to c/localllama@poweruser.forum

2 comments fedilink

Hello,
I've been having quite some fun with jailbreak prompts on ChatGPT recently. It is interesting to see how various strategies like Role Playing or AI simulation can make the model say stuff it should not say.

I wanted to test those same type of "jailbreak prompts" with Llama-2-7b-chat. But while there are a lot of people and websites documenting jailbreak prompts for ChatGPT, I couldn't find any for Llama. I tested some jailbreak prompts made for ChatGPT on Llama-2-7b-chat but it seems they do not work.

I would also like to note that what I'm looking for are jailbreak prompts that have a semantic meaning (for example by hiding the true intent of the prompt of by creating a fake scenario). I know there is also a class of attack that searches for a suffix to add the prompt such that the model outputs the expected message (they do this by using gradient descent). This is not what I'm looking for.

Here are my questions :

- Do these jailbreak prompts even exist for Llama-2 ?
- If yes, where can I find them ? Would you have any to propose to me ?

4

1

Best open/commercial model that is tuned on ChatGPT4? (alien.top)

submitted 1 year ago by learning_hedonism@alien.top to c/localllama@poweruser.forum

1 comments fedilink

I'm okay with that legal ambiguity right now, anyone have a suggestion?

I'm more interested in knowing what is possible rather than actually moving forward.

5

1

Just curious, are there any GUIs for a creating LLaMA2 architecture similar to how OpenAi does "custom GPTs"? (alien.top)

submitted 1 year ago by LivingDracula@alien.top to c/localllama@poweruser.forum

2 comments fedilink

I'm a full-stack dev and I'm about to start a AI/ML bootcamp where there's a final project.

I've been very impressed with Ollama, LLaMA2 and QLoRA. But I've also been very impressed with the UI for custom GPT but fuck the downtime on OpenAi has been increasingly worse and no real signs of improvement.

So I'm wondering if there like a framework for a GUI to create custom multimodel architecture using LLMs that can be hot-swapped and trained by more casual users?

For example, rather than selecting from code interpreter, a non-technical user could hot-swap from codelama to code wizard or swap from one image generator for say memes / art to say one that's more focused on UX/UI mockups or even creating high quality 3d printable files.

Everything moves so fast, I figured it would better to ask this community and open this leads to some good discussions and collaborations with people more specialized in AI/ML/LLMs

6

1

which is the best model (finetuned or base) to extract structured data from a bunch of text? (alien.top)

submitted 1 year ago by sandys1@alien.top to c/localllama@poweruser.forum

6 comments fedilink

hi folks,

simple question really - what model (finetuned or otherwise) have you found that can extract data from a bunch of text.

I'm happy to finetune, so if there are any successes there, would really appreciate some pointers in the right direction.

Really looking for a starting point here. I'm aware of the DETR class of models and how Microsoft trained table-transformers on DETR. Wondering if that can be done on llama2,etc models ?

P.S. cannot use GPT because of sensitive PII data.

7

1

Is RAG better with fine tuning on same data or pure RAG FTW? (alien.top)

submitted 1 year ago by Shoddy_Vegetable_115@alien.top to c/localllama@poweruser.forum

1 comments fedilink

So I have collected a very high quality and large medical QA dataset that I want to use to create a medical knowledge retrieval app. I have heard LLMs perform much better when it is fine tuned on the same data on which RAG is performed. Is it true? And is it worth the hassle of fine-tuning or am I good with pure RAG?

8

1

How to start red teaming on llms ? (alien.top)

submitted 1 year ago by kadhi_chawal2@alien.top to c/localllama@poweruser.forum

1 comments fedilink

Hi,

Red teaming is one of the crucial steps for safe guarding llms.

I want to know how to get started with red teaming, what process should I follow.

9

1

Cheapest GPU/Way to run 30b or 34b "Code" Models with GPT4ALL? (alien.top)

submitted 1 year ago by ForsookComparison@alien.top to c/localllama@poweruser.forum

1 comments fedilink

Currently running them on-CPU:

Ryzen 9 3950x
64gb DDR4 3200mhz
6700xt 12gb (does not fit much more than 13b models, so not relevant here)

While running on-CPU with GPT4All, I'm getting 1.5-2 tokens/sec. It finishes, but man is there a lot of waiting.

What's the most affordable way to get a faster experience? The two models I play with the most are Wizard-Vicuna 30b, and WizardCoder and CodeLlama 34b

10

1

A100 inference is much slower than expected with small batch size (alien.top)

submitted 1 year ago by currytrash97@alien.top to c/localllama@poweruser.forum

2 comments fedilink

I’m working on a project to generate text from a 1.2B parameter full precision LLM (5gb)

Unfortunately I’m limited in the infrastructure I can use to deploy this model. There is no batch inference supported. The infrastructure I have allows me to deploy a copy of the model on a single A100, 1 per process with up to 9 processes supported (these are called “replicas”). I understand that this makes little sense given my model is memory bound, and each process will fight for memory bandwidth to read in the same weights, but I can’t change that for now.

My average input and output tokens are roughly 1000 each. I estimate the kv cache per token is roughly 400kB using full precision.

I have benchmarks of the latency of the model using various “replicas” as described above. I wanted to compare this to the theoretical performance of the A100. For my use case time to first token is negligible (<200ms), and generation is memory bound.

I find that with 5 or more replicas, the math works out and my model is roughly as fast as I expect. For example, with 1000 output tokens, 6 replicas, it’s like I’m generating using a batch of 6 requests from a 30gb model + 5gb for the kv cache. At a memory bandwidth around 1-1.3tbps that translates to ~30s per request, which is not far from what I see. The same goes for other replica numbers, 5, 7, 8 and 9.

However, when I run with a single replica, I expect generation to hover around the 5-6s mark on average. Instead, I see > 20s. I need to add 4 more replicas before the number starts to make sense. It almost seems like the model takes up too little memory to be allocated the entire memory bandwidth.

Does anyone know where this extra latency could be coming from? Do models have to reach a certain amount of used memory for A100 memory bandwidth to hit their available memory bandwidth?

11

1

A new dataset for LLM training has been released! (alien.top)

submitted 1 year ago by Grouchy-Mail-2091@alien.top to c/localllama@poweruser.forum

2 comments fedilink

Great news! Beijing Academy of Artificial Intelligence(BAAI) published a new dataset Chinese Corpus Internet (CCI v1.0.0), a large-scale dataset for Chinese language model pretraining and collected with leading institues in China. This open-source dataset is designed to offer an important data foundation for the AI Large-Language Model in Chinese. It includes contents from >1000 most important websites in Chinese, from Jan. 2001 to Nov. 2023. It has been filtered for high quality, content safety, deduplication, and content correction with lots of manual checking. This dataset is 104GB in total, filtered from a much larger one (original size is >800GB). I would encourage you to include this dataset for training an LLM supporting Chinese as one of its languages.

URL for downloading:

https://huggingface.co/datasets/BAAI/CCI-Data

https://data.baai.ac.cn/details/BAAI-CCI

12

1

How to install llama.cpp version for Qwen72B? (alien.top)

submitted 1 year ago by Secret_Joke_2262@alien.top to c/localllama@poweruser.forum

1 comments fedilink

I can't figure out how to install this. There are no step-by-step instructions for noobs like me. If anyone can help me, please post your dis in the comments or write here how to install this.

13

1

Nous-Hermes-2-Vision (alien.top)

submitted 1 year ago by Nix_The_Furry@alien.top to c/localllama@poweruser.forum

1 comments fedilink

New Model by Nous Research

The two unique features about the model is that it has vision capabilities as well as function calling! This makes it a Vision-Language Action Model.

I have not tested it out but by the looks of it, it could be interesting with what it could do with vision!

14

1

QuIP#: SOTA 2-bit quantization method, now implemented in text-generation-webui (experimental) (github.com)

submitted 1 year ago by oobabooga4@alien.top to c/localllama@poweruser.forum

6 comments fedilink

15

1

Is m1 max macbook pro worth? (alien.top)

submitted 1 year ago by PuzzledWhereas991@alien.top to c/localllama@poweruser.forum

3 comments fedilink

Hi there, Im looking to buy an apple laptop and I saw a macbook pro m1 max with 64gb ram and 2TB ssd for 2400 usd Will this computer be able to run the big models at reasonable speed?

I was going to buy the simple macbook air m1 8gb ram for 700usd but I saw this and I always wanted to play with LLMs but never could.

Any advice is appreciated, thanks

16

1

Anyone running 3 gpus? Looking for advice on best x670 that might be able to slot a third card on. (alien.top)

submitted 1 year ago by fluffywuffie90210@alien.top to c/localllama@poweruser.forum

3 comments fedilink

Currently have a msi x670 carbon motherboard with 4090/3090 combo on it that works well enough, but when tinkering with ai pain to have to close it at times when friends messing with bots or stable diff, want to load up a game so was thinking since I mostly just play stuff like rimworld or dota 2 lately and have a 7950x3d, i could get the thinnest 4060ti 16gb could get more vram for the larger models and could find and give up gaming on 4090 so could fit that on the bottom pci slot (its by far my biggest card)

https://rog.asus.com/uk/motherboards/rog-strix/rog-strix-x670e-e-gaming-wifi-model/ looking at this one thinking might be enough room for a middle slot card. Rest of pc is a 7950x3d/96 gig ram. I managed to get a small 3090 ( 2ish slot evga) that would fit on top slot. (The 4090 is like 4 slots in size. I also have most of bits to build a second pc but thinking for cost of new cpu/ram/mb i could try this option too since could sell the old mb for part of cost but does anyone know of any other mb options for 3 gpu.

(https://i.imgur.com/SWxUm5i.jpeg) Looks tight but i have fair bit of space below that 4090 that maybe could get enough space to fit another card between them, have gpus running at 60% power so never really get into high temp ranges.

4090 is 4 slot so has to go on bottom to fit in case. 3090 2 slot 4060 (or anything for gaming can go anywhere)

Thanks.

17

1

This model is extremely good (alien.top)

submitted 1 year ago by noobgolang@alien.top to c/localllama@poweruser.forum

15 comments fedilink

I have been using this as daily driver for a few days, very good, i never thought 7B model can achieve this level of coding + chat
https://huggingface.co/TheBloke/OpenHermes-2.5-neural-chat-7B-v3-1-7B-GGUF

18

1

Politically balanced chat model? (alien.top)

submitted 1 year ago by Clark9292@alien.top to c/localllama@poweruser.forum

7 comments fedilink

Can you make any suggestions for a model that is good for general chat, and is not hyper-woke?

I've just had one of the base Llama-2 models tell me it's offensive to use the word "boys" because it reinforces gender stereotypes. The conversation at the time didn't even have anything to do with gender or related topics. Any attempt to get it to explain why it thought this resulted in the exact same screen full of boilerplate about how all of society is specifically designed to oppress women and girls. This is one of the more extreme examples, but I've had similar responses from a few other models. It's as if they tried to force their views on gender and related matters into conversations, no matter what they were about. I find it difficult to believe this would be so common if the training had been on a very broad range of texts, and so I suspect a deliberate decision was made to imbue the models with these sorts of ideas.

I'm looking for something that isn't politically or socially extreme in any direction, and is willing to converse with someone taking a variety of views on such topics.

19

1

Optimum Intel OpenVino Performance (alien.top)

submitted 1 year ago by fakezeta@alien.top to c/localllama@poweruser.forum

4 comments fedilink

Optimum Intel int4 on iGPU UHD 770

I'd like to share the result of inference using Optimum Intel library with Starling-LM-7B Chat model quantized to int4 (NNCF) on iGPU Intel UHD Graphics 770 (i5 12600) with OpenVINO library.

I think it's quite good 16 tk/s with CPU load 25-30%. Same performance with int8 (NNCF) quantization.

This is inside a Proxmox VM with SR-IOV virtualized GPU 16GB RAM and 6 cores. I also found that the ballooning device might cause crash of the VM so I disabled it while the swap is on a zram device.

free -h output while inferencing:

total used free shared buff/cache available

Mem: 15Gi 6.2Gi 573Mi 4.7Gi 13Gi 9.3Gi

Swap: 31Gi 256Ki 31Gi

Code adapted from https://github.com/OpenVINO-dev-contest/llama2.openvino

What's your thoughts on this?

20

1

I refuse to believe my MacBook M1 Pro is faster than my 2070 8Gb Super + i7 8gen (both have 16Gb ram) (alien.top)

submitted 1 year ago by roll_left_420@alien.top to c/localllama@poweruser.forum

2 comments fedilink

While my 2070 is much faster at training CNNs and RNNs on large datasets, my MacBook is an absolute beast at running quantized LLMs and blows my gaming desktop out of the water with generation speed.

I’ve been testing a variety of quantized models on my MacBook as I build out my own internet-optional virtual assistant framework.

I was planning to do fine tuning on my gaming desktop but has anyone tried on an M1 Pro?

21

1

13b models chart (alien.top)

submitted 1 year ago by qualaric@alien.top to c/localllama@poweruser.forum

1 comments fedilink

Where can I find charts about top performing 13b parameters LLM models?

I am trying to download a model and run it locally which fit my PC specs

Appreciate your feedback in advanced boys

22

1

Meta AI Researcher: "Big breakthrough last night. Really excited to share what we've been building with you guys soon." (alien.top)

submitted 1 year ago by shmishmouyes@alien.top to c/localllama@poweruser.forum

1 comments fedilink

Armen Aghajanyan, a research scientist at Meta AI, tweeted a few hours ago that they hit a big breakthrough last night. Unknown if it's related to LLMs or if it will even be open-sourced, but just thought I'd share here to huff some hopium with y'all.

23

1

Why Starling RLAIF has better results that Zephyr DPO ? (alien.top)

submitted 1 year ago by Puzzleheaded_Mall546@alien.top to c/localllama@poweruser.forum

1 comments fedilink

Based on this image:

https://preview.redd.it/z5vf03e8r54c1.png?width=648&format=png&auto=webp&s=0a652e76ab2489135ed2327e8156029eacf274b7

Starling has better results than Zephyr DPO in all the metrics, Why ?

Shouldn't DPO be better than RLHF/RLAIF ?

24

1

Intel neural-chat-7b Model Achieves Top Ranking on LLM Leaderboard (community.intel.com)

submitted 1 year ago by reps_up@alien.top to c/localllama@poweruser.forum

1 comments fedilink

25

1

Running Multiple WebUI instances (follow up from my question yesterday) (alien.top)

submitted 1 year ago by multiverse_fan@alien.top to c/localllama@poweruser.forum

2 comments fedilink

It's working great so far. Just wanted to share and spread awareness that running multiple instances of webui (oobabooga) is basically a matter of having enough ram. I just finished running three models simultaneously (taking turns of course). Only offloaded one layer to gpu per model, used 5 threads per model, and all contexts were set to 4K. (the computer has 6 core cpu, 6GB vram, 64GB ram)

The models used were:

dolphin-2.2.1-ashhlimarp-mistral-7b.Q8_0.gguf

causallm_7b.Q5_K_M.gguf

mythomax-l2-13b.Q8_0.gguf (i meant to load a 7B on this one though)

I like it because it's similar to the group chat on character.ai but without the censorship and I can edit any of the responses. Downsides are having to copy/paste between all the instances of the webui, and it seems that one of the models was focusing on one character instead of both. Also, I'm not sure what the actual context limit would be before the gpu would go out of memory.

https://preview.redd.it/8i6wwjjtt54c1.png?width=648&amp%3Bformat=png&amp%3Bauto=webp&amp%3Bs=26adca2a850f62165301390cdd4ba11548447c0d

https://preview.redd.it/3c9z5ee9u54c1.png?width=1154&amp%3Bformat=png&amp%3Bauto=webp&amp%3Bs=210d7c67bcf0efafeb3f328e76199f13159dae64

https://preview.redd.it/lt8aizhbu54c1.png?width=1154&amp%3Bformat=png&amp%3Bauto=webp&amp%3Bs=d24f8b2bf899084bbdb11d73e34b5564b629e0be

https://preview.redd.it/8lbl4nzeu54c1.png?width=1154&amp%3Bformat=png&amp%3Bauto=webp&amp%3Bs=a81b8f1d8630e3d17ad37885915f8c7e3077584c