simons blog

03 Oct, 2025 Mutual Refinement and Composition
27 Sep, 2025 Applied introduction to Categorical treatment of CuTe
23 Sep, 2025 Layout Gymnastics
20 Sep, 2025 Swizzles and their usage in CuTeDSL Kernels
14 Sep, 2025 CuTe partitions
09 Sep, 2025 Tensors Slicing in CuTe
07 Sep, 2025 Understanding CuTe Swizzling - The Math Behind 32B, 64B, and 128B Patterns
05 Sep, 2025 GPU L2 Cache Persistence
29 Aug, 2025 Cuda streams
23 Aug, 2025 PingPong 🏓 in the CuTeDSL with QuACK 🦆
17 Aug, 2025 Bit Hacking in C
13 Aug, 2025 Intuition behind Hierarchical Layouts
09 Aug, 2025 Persistent Float8 Dense Gemm on Hopper
04 Aug, 2025 Epilogue in CuTeDSL H100 kernels
30 Jul, 2025 Let the compiler do the work in CuTeDSL
25 Jul, 2025 Persistent GEMM in CuTeDSL on Hopper
20 Jul, 2025 Consumer-Producer pattern on H100 in CuTeDSL
13 Jul, 2025 Backprob through Layernorm
13 Jul, 2025 Backprop through RMSNorm
12 Jul, 2025 Outperform compiled PyTorch code using QuACK 🦆
05 Jul, 2025 CuTeDSL on Hopper - Pipelining
03 Jul, 2025 CuTeDSL on Hopper - WGMMA and TMA intro
28 Jun, 2025 Thread Value Layouts in CuTe
26 Jun, 2025 SGEMM in CuTeDSL
23 Jun, 2025 An applied introduction to CuTeDSL
21 Jun, 2025 Calculating the fibonacci numbers on GPU
16 Jun, 2025 An introduction to Thrust
13 Jun, 2025 Programming tensor cores in Mojo
09 Jun, 2025 Infinite binary strings
06 Jun, 2025 Highly efficient matrix transpose in Mojo 🔥
05 Jun, 2025 The Bijection Between Natural Numbers and Binary Strings
04 Jun, 2025 Use TMA without CUDA
29 May, 2025 Use PTX instructions in Mojo
25 May, 2025 Very fast vector sum without CUDA.
22 May, 2025 Short introduction to the Mojo programming language
18 May, 2025 Bridging Math and Code: CuTe Layout Algebra in CuTeDSL
14 May, 2025 Load and store matrices efficently with PTX instructions
11 May, 2025 How to use reasoning models with SGLang
08 May, 2025 A short note on Tensorcores and Inline PTX Assembly
02 May, 2025 Making matrix transpose really fast on Hopper GPUs
27 Apr, 2025 TMA introduction
21 Apr, 2025 Analyze CUDA programs by looking at GPU assembly.
18 Apr, 2025 Making RMSNorm really fast
13 Apr, 2025 Making prefix sum really fast
06 Apr, 2025 Making vector sum really fast
31 Mar, 2025 Predication in Cutlass
23 Mar, 2025 Indexing in CUDA