QWEN CHAT Hugging Face ModelScope DEMO DISCORD
Scaling Reinforcement Learning (RL) has the potential to enhance model performance beyond conventional pretraining and post-training methods. Recent studies have demonstrated that RL can significantly improve the reasoning capabilities of models. For instance, DeepSeek R1 has achieved state-of-the-art performance by integrating cold-start data and multi-stage training, enabling deep thinking and complex reasoning.
Our research explores the scalability of Reinforcement Learning (RL) and its impact on enhancing the intelligence of large language models.
Even the 7B version of Deepseek-R1 thinks less, doesn’t second guess itself, and comes back with the correct answer:
<think>
First, I need tocompare the two numbers: 9.21and9.1999999999.
Both numbers have the same integer part, which is9.
Next, I'll look at their decimal parts:
- The first number has a decimal part of0.21.
- The second number has a decimal part of approximately 0.20 (since there are many trailing nines after the decimal point).
To make an accurate comparison, it's helpful to express both decimals with the same number of places.For9.21, I can write it as9.210...
For9.1999999999, it remains approximately 9.20when considering several decimal places due to the trailing nines.
After comparing both numbers up to a reasonable number ofdecimal places, it's clear that 9.21 is greater than 9.1999999999.
</think>
Even the 7B version of Deepseek-R1 thinks less, doesn’t second guess itself, and comes back with the correct answer:
<think> First, I need to compare the two numbers: 9.21 and 9.1999999999. Both numbers have the same integer part, which is 9. Next, I'll look at their decimal parts: - The first number has a decimal part of 0.21. - The second number has a decimal part of approximately 0.20 (since there are many trailing nines after the decimal point). To make an accurate comparison, it's helpful to express both decimals with the same number of places. For 9.21, I can write it as 9.210... For 9.1999999999, it remains approximately 9.20 when considering several decimal places due to the trailing nines. After comparing both numbers up to a reasonable number of decimal places, it's clear that 9.21 is greater than 9.1999999999. </think>