overview for LocoMod

Yet another 70B Foundation Model: Aquila2-70B-Expr in c/localllama@poweruser.forum

[–] LocoMod@alien.top 1 points 1 year ago

BAAI has the best embedding model I’ve tried so I’m excited to see what comes of this.

Guys, I have a crazy idea. in c/localllama@poweruser.forum

[–] LocoMod@alien.top 1 points 1 year ago

This is what my hobby project essentially does. I’m running a single chat from 3 different servers in my network all serving different LLMs that are given a role in the chat pipeline. I can send the same prompt to multiple models so they can work on it concurrently, or have them handoff each other’s responses to continue elaborating, validating, or whatever that LLMs job is. Since each server is serving an API and websocket route, all I need to do is put it behind a proxy and port forward them to the public internet. Anyone here could visit the public URL and run inference workflows in my homelab(theoretically speaking). They could also spin up an instance on their side and we can have our servers talk to each other.

Of course that’s highly insecure and just bait for bad actors. So I will scale it using overlay network that requires a key exchange and runs over VPN.

Any startup thinking they are going to profit from this idea will only burn investor money and waste their own time. This will all be free and it’s only a matter of time before the open source community cuts into their hopes and dreams.

Anyone using Go or Rust for AI work instead of python? in c/localllama@poweruser.forum

[–] LocoMod@alien.top 1 points 1 year ago

I am not a Python hater, but Go is what Python should have been if it actually stuck to the Zen of Python.

You do know that what is arguably the most successful open-source project of the past decade that powers most of the modern internet is written in Go right?

https://github.com/kubernetes/kubernetes

100B, 220B, and 600B models on huggingface! in c/localllama@poweruser.forum

[–] LocoMod@alien.top 1 points 1 year ago

We need some hero to develop an app that downloads more GPU memory like those apps back in the 90's. /s

New Model: Starling-LM-11B-alpha-v1 in c/localllama@poweruser.forum

[–] LocoMod@alien.top 1 points 1 year ago

I’m getting the same output. Those are line breaks. How odd…

Are there any data cleaning focused LLMs? [also, rant] in c/localllama@poweruser.forum

[–] LocoMod@alien.top 1 points 1 year ago

Ideally we would be better in a timeline where LLMs could do this better than classical methods but we’re not there yet. You can code a handler that cleans up html retrieval quite trivial since you’re just looking for the text in specific tags like articles, headers, paragraphs, etc. There are a ton of frameworks and examples out there on how to do this and a proper handler would execute the cleanup in a fraction of the time even the most powerful LLM ever hoped to.

Starling-RM-7B-alpha: New RLAIF Finetuned 7b Model beats Openchat 3.5 and comes close to GPT-4 in c/localllama@poweruser.forum

[–] LocoMod@alien.top 1 points 1 year ago (1 children)

Quantz are up:

https://huggingface.co/TheBloke/Starling-LM-7B-alpha-GGUF/tree/main

Any M2 ultra reviews? in c/localllama@poweruser.forum

[–] LocoMod@alien.top 1 points 1 year ago

This is something I’ve noticed with large context as well. This is why the platform built around LLMs is what will be the major differentiator for the foreseeable future. I’m cooking up a workflow to insert remote LLMs as part of a chat workflow and successfully tested running inference on a fast Mistral-7B model and a large Dolphin-Yi-70B on different servers from a single chat view successfully about an hour ago. This will unlock the capability to have multiple LLMs working together to manage context by providing summaries, offloading realtime embedding/retrieval to a remote LLM, and a ton of other possibilities. I got it working on a 64GB M2 and a 128GB M3. Tonight I will insert the 4090RTX into the mix. The plan is to have the 4090 run small LLMs. Think 13B and smaller. These run and light speed on my 4090. Its job can be to provide summaries of the context by using LLMs finetuned for that purpose. The new Orca13B is promising little agent that so far follows instructions really well for these types of workflows. Then we can have all 3 servers working together on a solution. Ultimately, all of the responses would be merged into the “ideal response” and output as the “final answer”. I am not concerned with speed for my use case as I use LLMs for highly technical work. I need correctness above all even if this means waiting a while for the next step.

I’m also going to implement a mesh VPN so we can do this over WAN and scale it even more with a trusted group of peers.

The magic behind ChatGPT is the tooling and how much compute they can burn. My belief is the model is less relevant than folks think. It’s the best model no doubt, but if we were allowed to run it on the CLI as a pure prompt/response workflow between use and model with no tooling in between, my belief is it would be a lot like the best open source models…

We’re going to need more modular designs for LLMs. in c/localllama@poweruser.forum

[–] LocoMod@alien.top 1 points 1 year ago

You’ve basically described the entire purpose behind Retrieval Augmented Generation.

30,000 AI models in c/localllama@poweruser.forum

[–] LocoMod@alien.top 1 points 1 year ago (1 children)

What's stopping us from building a mesh of web crawlers and creating a distributed database that anyone can host and add to the total pool of indexers/servers? How long would it take to create a quality dataset by deploying bots that crawl their way "out" of the most popular and trusted sites for particular knowledge domains and just compress and dump that into a format for training into said global p2p mesh? If we got a couple of thousand nerds on Reddit to contribute compute and storage capacity to this network we might be able to build it relatively fast. Just sayin...

OpenAI brings Sam Altman back as CEO in c/localllama@poweruser.forum

[–] LocoMod@alien.top 1 points 1 year ago

This is going to be a solid Netflix miniseries on two years. I am cheering for the team at OpenAI. There is no victory in stagnation and no honor fearing the unknown. Full steam ahead folks. And bring on GPT5 already!

Has anybody successfully implemented web search/browsing for their local LLM? in c/localllama@poweruser.forum

[–] LocoMod@alien.top 1 points 1 year ago

I have. You simply parse the prompt for a url and then write a handler to retrieve the page content using whatever language or framework you use. Then you clean it up and send the content along with the prompt to the LLM and do QA over it.

1

Multiple Local LLMs Pipeline on M-Series Mac (alien.top)

submitted 1 year ago by LocoMod@alien.top to c/localllama@poweruser.forum

0 comments fedilink

Left / CodeBooga34B - Right / NousCapybara34B

I've been working on a Go framework with the intent of having all of the basic dependencies for LM workflows as a hobby project to learn the foundation and architecture supporting LLMs. This has allowed me to build basic pipelines and experiment freely. Lately, I have been testing running multiple LLMs, concurrently on the same host. This is possible due to the unified memory architecture on modern Apple hardware. I recently read that using the same model with different system prompts to simulate agents collaborating with each other is less than ideal since the model would tend to agree with itself given its the same dataset. At least this is the way I interpreted it. Today I was finally able to setup a pipeline to provision CodeBooga34B and NousCapybara34B on an M2 with 64GB of memory and to my surprise, it worked! The test was to have CodeBooga generate a simple Go program, and then have NousCapybara validate and enhance the output of CodeBooga. The code that was generated worked without edits on my behalf!

What other interesting pipelines, workflows or tests would be ideal? The framework uses goroutines and websockets and I should be able to essentially cycle the models in and out as needed. For example, while "model 2" is generating and validating the answer from "model 1", we could be loading "model 3" in the background ready to receive the output from "model 2", so on and so forth.

Thoughts about other interesting workflows?