this post was submitted on 27 Nov 2023
1 points (100.0% liked)

LocalLLaMA

1 readers
1 users here now

Community to discuss about Llama, the family of large language models created by Meta AI.

founded 10 months ago
MODERATORS
 

Hello fellow llamas!!!

Here is what I am hacking on….

I am exploring new ways to build generative AI foundational models without traditional math-centric training costs and resources. I am trying to lower the bar for anyone looking to build and share models that are:

- task-trained - models are trained to do very specific task(s) with only the required datasets (explicitly-overfitting for known use case(s) instead of generalized/underfitting and having to wait to search through the entire internet to respond)

- modular - because the models only know about these smaller, task-trained dataset(s) the models will hopefully be faster at responding than today's

- device-native - models are targeted for constrained environments that do not have gpu clusters, excess ram/cpu/storage/connectivity

- open source - since the weights are public domain, the derived intelligence should be public domain

- type of foundational model: weight-derived (blog: https://matlok.ai/ docs: https://bampe-weights.readthedocs.io/en/latest/)

I believe there may be some math/stats proofs that are missing (see the smooth-brain), but I want to push this modular/lego block like approach in hopes of reaching parity with a new generation of foundational models. One of my fundamental assumptions is that if I substantially-reduce the training corpus, a smaller/overfit model will hopefully be faster than a traditionally-trained large language model. The initial, slimmer model building process should also hopefully run on IoT devices and plug-in to existing distributed architectures (device-native).

What are you doing next - Initial use case?

I need help with a good initial use case (please let me know if you have better ones!). Current best idea of the week/last 3 days: I believe this approach and knowledge system of assembling weight-derived models should be shared so we can ensure concepts like an “ethical watermark” for Asimov's Laws of Robotics are always present in all pre-trained AI model weights using cosine similarity searches. As this approach matures, we should be able to audit and report on what these models know, and I think we need a community-driven project to tackle it.

tl;dr

It's early days, but I believe we can reuse existing AI tensor weights complemented with smaller "fine-tuning"-sized datasets to build small, high-quality fast generative models.

PoC repository:

https://github.com/matlok-ai/bampe-weights

Inputs

Extracted tensor weight from a GPT2 model.safetensors file:

extracted tensor weight

https://raw.githubusercontent.com/matlok-ai/gen-ai-datasets-for-bampe-weights/main/docs/images/safetensors/gpt2/in/idata__h.0.attn.c_attn.weight.png

Outputs

Predicted weight-derived file for use in a new type of foundational generative AI model

This screenshot is an example of \"trained weights\" for a new type of foundational generative AI model (referred to as a weight-derived model)

https://raw.githubusercontent.com/matlok-ai/gen-ai-datasets-for-bampe-weights/main/docs/images/safetensors/gpt2/out/gpu-generated_predicted-model-weights__layer__h.0.attn.c_attn.weight__chunk__0.png

Thanks for the help, guidance and assistance staying up with the insane speed of this ecosystem!

Reach out if you want more info - my email is in the profile

you are viewing a single comment's thread
view the rest of the comments
[–] dqUu3QlS@alien.top 1 points 9 months ago (3 children)

If I understand this correctly, you're using a smaller NN to predict the weights of a larger one? Have you tested to make sure this approach preserves the performance of the larger model? What advantage does your approach have compared to existing approaches - distillation, quantization, pruning, just training smaller models directly?

I can think of some clear disadvantages for performance.

[–] buildinstuff5432@alien.top 1 points 9 months ago (2 children)

Great questions! In the poc https://bampe-weights.readthedocs.io/en/latest/ I’m exploring can I extract weights from a larger, pretrained AI model’s weights (https://huggingface.co/gpt2/tree/main) and then reuse the predict smaller, subset of new weights for a hypothetical smaller model.

I think this approach works because we have many AI models with “good enough” answers already (a source of truth) that we can start exploring new ways to build them to reach parity with the current generation. I believe there is a way to hand-mold models as an individual without many gpu(s) and without higher-level math training by reusing today’s weights with today’s image-to-image transformers to answer/solve a subset from the original large, pretrained weights’ domain knowledge (uproven). Until I get the first small one reassembled, this is me just sharing the journey as-I-go type of post.

A large technical disadvantage: I think we need a new type of precision cutting tool to extract and recognize shapes inside tensor weight images, and I am initially thinking of using an embedding database/store (e.g. a modified postgres https://github.com/pgvector/pgvector) that performs the cosine similarity search over the embedded weights to do this (no gpu required). When compared to today’s paradigm for training and building models I have to reuse and search the entire internet for each answer, and I need gpu gear to run anything >30b because of how these models were foundationally-trained. I totally agree there’s a ton of disadvantages/risk with any new approach that is rebuilding something from the ground up (especially with this level of maths), but the poc shows today’s models can predict new weights without training and without entity extraction/ml and within 13-30 seconds the output is are not dramatically horrible vs the original source weights and we get a configurable-sized output chunk for reassembly that works without gpu (test chunk sizes ~0.7-11.8mb per chunk).

[–] OVAWARE@alien.top 1 points 9 months ago

This sounds absolutely crazy and something that both should never work and also should work and I don’t know how to feel about it, its a interesting idea at leasy

[–] dqUu3QlS@alien.top 1 points 9 months ago

A large technical disadvantage: I think we need a new type of precision cutting tool to extract and recognize shapes inside tensor weight images

Why do you think we need this? To me, it just indicates that the structure of Stable Diffusion is designed for real-world photos, artwork, and diagrams, and ill-suited for predicting the weights of an LLM.

the poc shows today’s models can predict new weights without training and without entity extraction/ml and within 13-30 seconds the output is are not dramatically horrible vs the original source weights.

Are you sure the output isn't dramatically horrible? To me the predicted weight images look nothing like the original weight images. The fine detail is completely different.

But it doesn't even matter how it looks to human eyes. What matters is, when a new model is constructed from the predicted weights, whether that model makes mostly-correct predictions.