Researchers from Arizona State University

Our investigation, conducted through the controlled environment of DataAlchemy, reveals that the apparent reasoning prowess of Chain-of-Thought (CoT) is largely a brittle mirage. The findings across task, length, and format generalization experiments converge on a conclusion: CoT is not a mechanism for genuine logical inference but rather a sophisticated form of structured pattern matching, fundamentally bounded by the data distribution seen during training. When pushed even slightly beyond this distribution, its performance degrades significantly, exposing the superficial nature of the “reasoning” it produces. 

Ars Technica has a summary