honestduane

joined 1 year ago
 

What is Q*? Everybody is asking. OpenAI has not made that public officially, But I figured out it was related to A* pathfinding used in AI for games. So I built up the context in OpenAI chatGPT, and then got OpenAI to explain it.. so here you go.

The Q* algorithm is a reinforcement learning algorithm used in machine learning for solving problems related to decision-making and sequential actions. It is closely related to the Q-learning algorithm and is designed to find an optimal policy in a Markov decision process (MDP), where an agent interacts with an environment to maximize a cumulative reward.

Here's how the Q* algorithm works:

Initialization: Initialize a Q-table that represents the expected cumulative rewards for each state-action pair in the MDP. Initially, these values are often set to zero or random values.

Exploration vs. Exploitation: The agent decides whether to explore new actions or exploit the current knowledge to maximize expected rewards. Exploration is important for discovering better actions, while exploitation is about choosing actions based on the current Q-table.

Action Selection: The agent selects an action based on an exploration-exploitation strategy. Common strategies include epsilon-greedy, where the agent chooses the action with the highest Q-value with a certain probability (epsilon) and explores random actions with a probability of (1 - epsilon). Interact with the Environment: The agent performs the selected action and observes the new state and the immediate reward from the environment.

Update Q-Values: Using the observed reward and the new state, the agent updates the Q-value for the previous state-action pair. Q* uses a slightly different update rule compared to Q-learning.

The update equation for Q* is: Q*(s, a) = Q*(s, a) + α * [R + γ * max(Q*(s', a')) - Q*(s, a)]

Q*(s, a) is the updated Q-value for state s and action a.

α is the learning rate, controlling how much the Q-value is updated.

R is the immediate reward obtained after taking action a in state s.

γ is the discount factor that determines the importance of future rewards.

s' is the new state after taking action a.

a' is the action that maximizes the Q-value in state s'.

Repeat: Continue the process of action selection, interaction with the environment, and Q-value updates for a large number of iterations or until convergence.

Policy Extraction: Once the Q* algorithm has converged or reached a suitable point, the optimal policy can be extracted by selecting the action with the highest Q-value for each state.

The goal of the Q* algorithm is to find the optimal Q-values that represent the expected cumulative rewards for each state-action pair, leading to an optimal policy that maximizes the agent's long-term rewards in the Markov decision process.

The fun thing? Its just the same scientific process we humans use to learn, trying new things, evaluating our results, taking notes, and stopping if an idea doesn't seem to be working out. But because it requires that "tree of mind" logic described mathematically above, its very expensive to run, and shows the value of brain cycles as CPU cycles.

[–] honestduane@alien.top 1 points 11 months ago

Q* was completely explained, and openAI explained what it was. I was even able to make a YouTube video about it because they’re explanation was so clear, so I was able to explain it as if you were five years old.

I don’t understand how people believe this is a secretive thing and I don’t understand why people aren’t talking about how simple it is.

Everybody is talking about this like it’s some grand secret, why?

I mean, the algorithm is expensive to run, but it’s not that hard to understand.

Can somebody please explain why everybody’s acting like this is such a big secret thing?

[–] honestduane@alien.top 1 points 1 year ago

Basically, what happened was the government said that they wanted greater accountability on models and AI so they added ambiguous requirements that are kind of impossible to enforce all the way, but also create other problems for the current generation of AI.

For example a model must comply by not being a national security threat, and that’s mostly to deal with misinformation, but it’s open ended enough that any model can suddenly become noncompliant with a court order overnight.

Citations are a big one that people don’t like but I actually like it because I feel like the models should be able to tell where they got the data from, to limit hallucinations. It’s just not something that the current technical frameworks really make easy or even possible because the training process itself is designed to make this impossible in most cases; companies don’t like it because it gets rid of the ambiguity that they previously had on the data said that they were using in brings to light the fact that now companies that make these models have to explicitly, make sure that every single bit of data they use for training allows them to do so it is something that they can see is OK being presented as a citation.

This brings back all the intellectual property stuff that is currently being ignored by the groups that are going around and stealing data for training; it also cuts the shadow elements out of the market and they don’t like that but I think it’s a good thing.

Most of the people that are complaining about this are just angry because they understand this also widens the moat, while at the same time directly harms openAI as the current incumbent that does not use citations the correct way.