Michael Nu ez / VentureBeat : Microsoft launches in-house AI models MAI-Transcribe-1, MAI-Voice-1, and MAI-Image-2, built by its superintelligence team, as it pursues AI self-sufficiency Microsoft on Wednesday launched three new foundational AI models it built entirely in-house a state

Techmeme

1mabout 3 hours ago

ModelsLive

LLMOps in 2026: The 10 Tools Every Team Must Have

Don’t deploy another model until you check out these essential 2026 LLMOps tools.

KDnuggets

1mabout 1 hour ago

ModelsLive

An interview with Mustafa Suleyman on Microsoft s AI reorg, how revising its OpenAI contract "unlocked [Microsoft s] ability to pursue superintelligence", more (Hayden Field/The Verge)

Hayden Field / The Verge : An interview with Mustafa Suleyman on Microsoft's AI reorg, how revising its OpenAI contract unlocked [Microsoft's] ability to pursue superintelligence , more Its new transcription model is a step towards those goals, says Microsoft AI's Mustafa Suleyman.

Techmeme

1m33 minutes ago

ModelsLive

Vulkan backend much easier on the CPU and GPU memory than CUDA.

On linux and compiled my own llama.cpp with CUDA support, top would always show one pegged CPU core at 100% when running Qwen3.5-9B-GGUF:Q4_K_M on my potato like RTX A2000 12GB. Also, nvidia-smi would show 11GB+ of GPU memory usage. Speed is ~30 tokens per second. My system fans would spin up when this single core gets pegged which was annoying to listen to. Decided to compile llama.cpp again with Vulkan backend to see if anything would be different. Well it was a big difference when using the exact same model Now, top is only showing one CPU core at about 30% usage and nvidia-smi is only showing 7.2GB of GPU memory usage. Speed is the same at ~30 tokens per second. No longer have my system fan spinning up when running inferencing. Just curious why the GPU memory footprint is lower and CPU

Reddit r/LocalLLaMA

1m33 minutes ago