> In other words, it seems as if you can take any past state and fold it into one large concatenated present state
These are what n-grams are, even traditional token/word/character based Markov chains don't rely on just the most recent word. Typical Markov Chains in NLP are 3-7-grams.
> Can you give an example of a non-Markov system?
Encoder-decoder LLMs violate the Markov Property and would not count as Markov Chains.
If you include the encoder outputs as part of the state, then encoder-decoder LLMs are Markovian as well. While in token space, decoder-only LLMs are not Markovian. Anything can be a Markov process depending what state you include. Humans, or even the universe itself are Markovian. I don't see what insight about LLMs you and other commenters are gesturing at.
These are what n-grams are, even traditional token/word/character based Markov chains don't rely on just the most recent word. Typical Markov Chains in NLP are 3-7-grams.
> Can you give an example of a non-Markov system?
Encoder-decoder LLMs violate the Markov Property and would not count as Markov Chains.