Looks very interesting!
Will this work on a pre-AVX CPU only machine? ( I happen to be far away from a computer right now to test)
Community to discuss about Llama, the family of large language models created by Meta AI.
Looks very interesting!
Will this work on a pre-AVX CPU only machine? ( I happen to be far away from a computer right now to test)
I am aware OpenAI made RAG more accessible by chunking and embedding files that users upload, but haven't many use cases of LLMs long used RAG?
Has there been a recent breakthrough or quality upstep in embeddings I am unaware of? What makes this different to a library like Sentence-Transformers may I ask please?
I've found those models can embed thousands of paragraphs a minute on local hardware which is enough for most local use cases.
With the upvotes and comments I'm sure you've built something really useful I'd just love to know how you envision it being used.
I am happy OpenAI just joined the RAG game in terms of user interface.
The "backend" offering from OpenAI for API embedding models has always been quite underwhelming, to the point that multiple models outperform OpenAI latest models. Their old v1 embedding models where so bad, that any sentence-transformer would win against them, while being 10-15x cheaper.
Here is a post, showing e.g. Benchmarks from LLamaIndex last week https://blog.llamaindex.ai/boosting-rag-picking-the-best-embedding-reranker-models-42d079022e83, showing how OpenSource models from Jina and BGE are on par or better, than offerings from Google, OpenAI, and even Cohere.
The only thing missing was a piece of OSS infrastructure, and that is what https://github.com/michaelfeil/infinity is here for.
Obviously, you can run all things at OpenAI (user interface, embeddings, vector db, llm) - but I guess that's not what r/LocalLLaMA stands for.
Thank you for the detail and references. Yeah I know, I have used sentence transformers and before them BERT/T5 embeddings for a long time (e.g. Kaggle competitions, few hackathons around the issue...), but I am just wondering what motivated you to create an embeddings server as opposed to running the embeddings in place in the code with the SBERT models or calling an API as you mention with those alternatives? Is the python code you write in the get started part much faster than just using the SentenceTransformer module with batch arrays?
Because I have found, such as when competing in the Learning Agency competitions, you can build the indexes locally or use open source tools like LlamaIndex equivalents with SBERT, rather then need to set up a server. Am I missing something to do with speed or do new models take longer to embed? What's the problem you and others are facing to use a server for embeddings rather than do it in the code?
Very interesting project indeed and seems to at part be what I’m doing at: https://github.com/d0rc/agent-os