You Can Learn Tokenization End-to-End with Reinforcement Learning
My work on learning to draw token boundaries using time-discounted score function estimates has been accepted into ICML 2026! [paper], [code], [Spotlight talk at ICLR 2026 workshop]
Adding Token-Strided Convolutions to NanoGPT
Unrelated to the paper, I tried to add byte-level information to standard tokenized LLMs using token-strided convolutions on byte embeddings. This roughly corresponds to applying a byte projection + linear projection to the following:
| o | ||||||
| l | ||||||
| u | ||||||
| t | _ | t | _ | |||
| o | _ | i | c | i | _ | c |
| k | s | d | o | o | a | o |
| e | t | e | n | n | r | o |
| n | r | d | v | s | e | l |
An interesting negative result is that this doesn’t meaningfully affect the NanoGPT speedrun baseline in terms of downstream loss.