Machine Learning

1 readers

1 users here now

Community Rules:

Be nice. No offensive behavior, insults or attacks: we encourage a diverse community in which members feel safe and have a voice.
Make your post clear and comprehensive: posts that lack insight or effort will be removed. (ex: questions which are easily googled)
Beginner or career related questions go elsewhere. This community is focused in discussion of research and new projects that advance the state-of-the-art.
Limit self-promotion. Comments and posts should be first and foremost about topics of interest to ML observers and practitioners. Limited self-promotion is tolerated, but the sub is not here as merely a source for free advertisement. Such posts will be removed at the discretion of the mods.

founded 2 years ago

MODERATORS

communick@academy.garden

[D]Three things I think should get more attention in large language models (alien.top)

submitted 2 years ago by ExaminationNo8522@alien.top to c/machinelearning@academy.garden

8 comments fedilink hide all child comments

Tokenization Techniques: Many people use the default BPE tokenizer for llama2 or other common tokenizers. But I think we could do a lot of experiments with different kinds of tokenizers, especially ones that are made to work well with certain types of data. The size of the vocabulary is a really important setting when you're working with big language models. You could try using a much smaller vocabulary and tokenizer for a data set that only includes certain words, and then train a model on that. This might help us train smaller models that still work really well on smaller amounts of data. I’d love to read any research papers about this.
Sampling Mechanisms: There’s a lot of discussion about models making things up, but not many people talk about how this could be connected to the way we pick the next word when generating text. Most of the time, we treat the model's output like a set of probabilities, and we randomly pick the next word based on these probabilities. But this doesn’t always make sense, especially for sentences that should have a clear answer. For example, if the sentence starts with "The capital of Slovakia is", random sampling might give you the wrong answer, even though the model knows that "Bratislava" is the most likely correct answer. This way of picking words randomly could lead to the model making things up. I wonder if we could create another model to help decide how to pick the next word, or if there are better ways to do this sampling.
Softmax Alternatives in Neural Networks: I've worked on designing processors for neural networks, and I’ve found that the softmax function is tricky to implement in hardware. However, I’ve had good results using the log(exp(x)+1) function instead. It's cheaper and easier to put into hardware and software. I’ve tried this with smaller GPT models, and the results looked just as good as when I used the softmax function.

top 8 comments

sorted by: hot top controversial new old

[–] waiting4omscs@alien.top 1 points 2 years ago

On 2, the users intent is unclear with what you've given. Do they want the answer, or is it part of some other narrative? There's a ton of valid continuations, like "located, larger, smaller, the, not" ... How would different sampling be better than what's currently available?

[–] residentmouse@alien.top 1 points 2 years ago

I’d add a few others to this list but I largely agree with the premise that we focus too much on attention. We lavish praise on the Transformer model but there is so much extra machinery that goes into it to make it work even a little bit, and now papers are coming out claiming ConvNets scale at the same learning rate, and the RetNet paper claims you can swap out attention altogether.

Obv. the issue is “emergence” (terrible term, but I mean non-linear training performance) and the sheer cost of testing permutations of LLM architecture at scale. To what extent has the ML community become the victim of sunk cost?

[–] cnapun@alien.top 1 points 2 years ago

As another commenter has pointed out, 2 is an active area of research; it's much easier to experiment with sampling in decoding because it generally involves a fixed model.

For your example, I believe nucleus sampling would solve that because the probability of the correct token should be very high (although i've only read cursory summaries, haven't read the paper/implementation in depth)

[–] Doormatty@alien.top 1 points 2 years ago

What are the current areas of research with regards to tokenization?

[–] VastUnique@alien.top 1 points 2 years ago

But this doesn’t always make sense, especially for sentences that should have a clear answer. For example, if the sentence starts with "The capital of Slovakia is"

A city? An interesting place? A place with amazing restaurants and culture? Language is extremely flexible and modular.

[–] ReasonablyBadass@alien.top 1 points 2 years ago

To 1: I remember a recent paper saying they got better results without tokenisation, at least in one area. Don't have the link right now though.

[–] ACreativeNerd@alien.top 1 points 2 years ago

Could someone explain how/why the log(exp(x)+1) works?

[–] Dangerous-Flan-6581@alien.top 1 points 2 years ago

On 2 totally agree!
On 3 how is log(exp(x)+1) an alternative to softmax? The outputs are not class probabilities. But I agree in general. I have many many many problems with the use of Softmax and hope there is a better alternative.