I turned ChatGPT into a free Grammarly Pro replacement without any vibe coding - How-To Geek
I turned ChatGPT into a free Grammarly Pro replacement without any vibe coding How-To Geek
Could not retrieve the full article text.
Read on Google News: ChatGPT →Sign in to highlight and annotate this article

Conversation starters
Daily AI Digest
Get the top 5 AI stories delivered to your inbox every morning.
Knowledge Map
Connected Articles — Knowledge Graph
This article is connected to other articles through shared AI topics and tags.
More in Models

I wrote a fused MoE dispatch kernel in pure Triton that beats Megablocks on Mixtral and DeepSeek at inference batch sizes
Been working on custom Triton kernels for LLM inference for a while. My latest project: a fused MoE dispatch pipeline that handles the full forward pass in 5 kernel launches instead of 24+ in the naive approach. Results on Mixtral-8x7B (A100): Tokens vs PyTorch vs Megablocks 32 4.9x 131% 128 5.8x 124% 512 6.5x 89% At 32 and 128 tokens (where most inference serving actually happens), it's faster than Stanford's CUDA-optimized Megablocks. At 512+ Megablocks pulls ahead with its hand-tuned block-sparse matmul. The key trick is fusing the gate+up projection so both GEMMs share the same input tile from L2 cache, and the SiLU activation happens in registers without ever hitting global memory. Saves ~470MB of memory traffic per forward pass on Mixtral. Also tested on DeepSeek-V3 (256 experts) and




Discussion
Sign in to join the discussion
No comments yet — be the first to share your thoughts!