My work on learning to draw token boundaries using time-discounted score function estimates has been accepted into ICML 2026! [paper], [code], [Spotlight talk at ICLR 2026 workshop]

Adding Token-Strided Convolutions to NanoGPT

Unrelated to the paper, I tried to add byte-level information to standard tokenized LLMs using token-strided convolutions on byte embeddings. This roughly corresponds to applying a byte projection + linear projection to the following:

             
        o    
        l    
        u    
t     _ t   _
o _ i c i _ c
k s d o o a o
e t e n n r o
n r d v s e l

An interesting negative result is that this doesn’t meaningfully affect the NanoGPT speedrun baseline in terms of downstream loss.