Please turn JavaScript on
Daily Dose of Data Science icon

Daily Dose of Data Science

Click on the "Follow" button below and you'll get the latest news from Daily Dose of Data Science via email, mobile or you can read them on your personal news page on this site.

You can unsubscribe anytime you want easily.

You can also choose the topics or keywords that you're interested in, so you receive only what you want.

Daily Dose of Data Science title: Daily Dose of Data Science

Is this your feed? Claim it!

Publisher:Ā  Unclaimed!
Message frequency:Ā  0.17 / day

Message History

Recap

In Chapter 9, we studied RLHF. We started from a simple asymmetry, for most tasks worth doing, good behavior is hard to write down but easy to recognize. So instead of writing rules, we collect human judgments and learn from them.

We then assembled the pipeline in stages. Supervised fine-tuning gave us a starting policy and a frozen reference model. The reward mod...


Read full story

Every agent, underneath whatever framework you’re using, runs the same loop.

while True: response = model(context) if response.has_tool_calls(): context += run_tools(response.tool_calls) else: breakSend the context to the modelIt responds with tool callsRun those toolsAppend the results to the contextAnd send it back.

It keeps going until the model replies without ...


Read full story
Recap

In Chapter 8, we studied proximal policy optimization (PPO). We started from the concern that on-policy data is only valid near the policy that produced it. A single oversized update can therefore push the agent off a cliff. The fix was the trust region, which meant a limit on how far each update may move the policy.

We then built the machinery in stages. The impo...


Read full story
Recap

Chapter 7 taught us about the policy gradient approach. Instead of learning values and acting greedily, we parameterized the policy directly and pushed its parameters in the direction that increases expected return.

The log-derivative trick turned an intractable gradient into something we could estimate from sampled experience.

REINFORCE was the first algori...


Read full story
Recap

In chapter 6, we moved from linear value function approximation to neural networks.

The value function generalized from a linear form to a neural network. What we gave up in that move was the convergence guarantee that held for linear on-policy methods.

We then saw what breaks when deep networks meet online Q-learning: correlated samples, moving targets, and...


Read full story