damhack

joined 1 year ago
[–] damhack@alien.top 1 points 11 months ago

It depends on what level of abstraction you are claiming deterministic behaviour. As stated elsewhere, at the upper level of qualia, it’s hard to say whether something that looks and feels like a decision made with free will is or isn’t.

Likewise, if you move to the lower levels of bit patterns, electron flow or quantum events, it looks to an outside observer to be non-deterministic.

So, at the absurd level of abstraction that posits symbols being manipulated by executing software are real phenomena, you could argue that neural nets are deterministic.

But at what point and to which observer does complexity become indistinguishable from randomness?

It’s a shaky argument that is based on the perfect functioning of an ideal of a computer to claim determinism, when we know in practice that abstraction levels bleed into each other, form strange loops and the Blue Screen Of Death is only ever a couple of bits away, especially when the sun flares and you’re not using ECC RAM.

[–] damhack@alien.top 1 points 11 months ago

Divide those timescales by 4 and you are on the mark.

[–] damhack@alien.top 1 points 11 months ago

Transformers are not a path to AGI. They’re too dumb and static. Active Inference is where it’s at.

[–] damhack@alien.top 1 points 11 months ago

Depends on if the bubble you are referring to is the current crop of Transformers. Which will keep running for a good while. However, the bubble will be made larger once Karl Friston’s group start releasing and low power Transformer-optimised optical chips start production next year. I can’t see an end to it as most current issues are in the process of being solved, including smaller, faster, less hallucinatory low compute models. We’re only just hitting the multimodal on-ramp and that journey has far to go.

[–] damhack@alien.top 1 points 1 year ago

That’s how pretraining is already done. You would have the same issue, orders of magnitude greater latency. Given the number of calculations per training epoch, you don’t want to be bound by the slowest worker in the cluster. OpenAI etc. use 40Gbps (or 100Gbps nowadays) backplanes between A100/H100 GPU servers. Sending data over the Internet to an Nvidia 1080 is simply just slow.

[–] damhack@alien.top 1 points 1 year ago

You’d better tell the GPU manufacturers that LLM workloads can’t be parallelized.

The point of Transformers is that the matrix operations can be parallelized, unlike in standard RNNs.

The issue with distributing those parallel operations is that for every partition of the workload, you introduce latency.

If you offload a layer at a time, then you are introducing both the latency of the slowest worker and the network latency, plus the latency of combining results back into one set.

If you’re partitioning at finer grain, eg parts of a layer, then you add even more latency.

Latency can go from 1ms per layer in a monolithic LLM to >1s. That means response times measured in multiple minutes.

[–] damhack@alien.top 1 points 1 year ago

How about exploring a framework for synchronizing time-variant multi-modal inputs so that multiple “senses” can be associated with an “event” that can be treated as an inference object. Extend this to simulating synesthesia too and you have a powerful approach for feeding cognition in AGI and robotics.