LocalLLaMA

3 readers

1 users here now

Community to discuss about Llama, the family of large language models created by Meta AI.

founded 1 year ago

MODERATORS

communick@poweruser.forum

Hugging Face Llama-2 (7b) taking too much time while inferencing (alien.top)

submitted 1 year ago by atinesh229@alien.top to c/localllama@poweruser.forum

1 comments fedilink hide all child comments

Hello everyone, I am trying to use Llama-2 (7b) from Hugging face. With below code I was able to load the model successfully but when I am trying to generate the output its taking forever.

Code

from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("Llama-2-7b-hf")
model = AutoModelForCausalLM.from_pretrained("Llama-2-7b-hf")

input_ids = tokenizer.encode("What is LLM?", return_tensors="pt")

output = model.generate(
        input_ids,
        temperature=0,
        max_new_tokens=100
    )

generated_text = tokenizer.decode(output[0])
print(generated_text)

Model files downloaded from Llama-2-7b-hf

Hardware: Macbook Pro (M2 Pro) 16 GB RAM

top 1 comments

sorted by: hot top controversial new old

[–] Tacx79@alien.top 1 points 1 year ago

You're loading it in fp32 which requires ~28gb of memory, try koboldcpp or oobabooga with GGUF models from TheBloke