🧠Q-learning is based on the concept of Markov Decision Processes (MDP).
🔑The Q function evaluates the total reward of taking a specific action in a given state.
💡A policy determines the action to take in a given state based on the Q function.
🔄Q-learning iteratively updates the Q function by comparing the predicted reward with the actual reward.
⚖️The optimal policy is determined by the Q values, which represent the expected total reward for each action in each state.