Machine Learning

1 readers

1 users here now

Community Rules:

Be nice. No offensive behavior, insults or attacks: we encourage a diverse community in which members feel safe and have a voice.
Make your post clear and comprehensive: posts that lack insight or effort will be removed. (ex: questions which are easily googled)
Beginner or career related questions go elsewhere. This community is focused in discussion of research and new projects that advance the state-of-the-art.
Limit self-promotion. Comments and posts should be first and foremost about topics of interest to ML observers and practitioners. Limited self-promotion is tolerated, but the sub is not here as merely a source for free advertisement. Such posts will be removed at the discretion of the mods.

founded 2 years ago

MODERATORS

communick@academy.garden

[D]Three things I think should get more attention in large language models (alien.top)

submitted 2 years ago by ExaminationNo8522@alien.top to c/machinelearning@academy.garden

8 comments fedilink hide all child comments

Tokenization Techniques: Many people use the default BPE tokenizer for llama2 or other common tokenizers. But I think we could do a lot of experiments with different kinds of tokenizers, especially ones that are made to work well with certain types of data. The size of the vocabulary is a really important setting when you're working with big language models. You could try using a much smaller vocabulary and tokenizer for a data set that only includes certain words, and then train a model on that. This might help us train smaller models that still work really well on smaller amounts of data. I’d love to read any research papers about this.
Sampling Mechanisms: There’s a lot of discussion about models making things up, but not many people talk about how this could be connected to the way we pick the next word when generating text. Most of the time, we treat the model's output like a set of probabilities, and we randomly pick the next word based on these probabilities. But this doesn’t always make sense, especially for sentences that should have a clear answer. For example, if the sentence starts with "The capital of Slovakia is", random sampling might give you the wrong answer, even though the model knows that "Bratislava" is the most likely correct answer. This way of picking words randomly could lead to the model making things up. I wonder if we could create another model to help decide how to pick the next word, or if there are better ways to do this sampling.
Softmax Alternatives in Neural Networks: I've worked on designing processors for neural networks, and I’ve found that the softmax function is tricky to implement in hardware. However, I’ve had good results using the log(exp(x)+1) function instead. It's cheaper and easier to put into hardware and software. I’ve tried this with smaller GPT models, and the results looked just as good as when I used the softmax function.

you are viewing a single comment's thread
view the rest of the comments

[–] residentmouse@alien.top 1 points 2 years ago

I’d add a few others to this list but I largely agree with the premise that we focus too much on attention. We lavish praise on the Transformer model but there is so much extra machinery that goes into it to make it work even a little bit, and now papers are coming out claiming ConvNets scale at the same learning rate, and the RetNet paper claims you can swap out attention altogether.

Obv. the issue is “emergence” (terrible term, but I mean non-linear training performance) and the sheer cost of testing permutations of LLM architecture at scale. To what extent has the ML community become the victim of sunk cost?