Please turn JavaScript on

hgpu

Want to stay in touch with the latest updates from Hgpu? That's easy! Just subscribe clicking the Follow button below, choose topics or keywords for filtering if you want to, and we send the news to your inbox, to your phone via push notifications or we put them on your personal page here on Specificfeeds.

Reading your RSS feed has never been easier!

Website title: High performance computing on graphics processing units | hgpu.org

Is this your feed? Claim it!

Publisher:  Unclaimed!
Message frequency:  0.56 / day

Message History

Iterative GPU kernel tuning is bottlenecked by the scale of the applications that host the kernels. Rapid iteration requires isolating the kernel so it can be edited, recompiled, and validated without rebuilding the full application — but manual isolation requires reconstructing build flags, dispatch configuration, and runtime inputs by hand, so developers usually settle for ...


Read full story

Performance profiles of GPU kernels generated by tools such as Nsight Compute are rich in detail but are often challenging to interpret. To achieve the best performance possible on a given GPU architecture, kernel developers need to spend significant time analyzing and comparing profiles in the tool’s graphical interface to identify and understand kernel performance bottlenec...


Read full story

Efficient CUDA implementations of attention mechanisms are critical to modern deep learning systems, yet supporting diverse and evolving attention variants remains challenging. Existing frameworks and compilers trade performance for flexibility, while expert-written kernels achieve high efficiency but are difficult to adapt. Recent work explores large language models (LLMs) f...


Read full story

Rapidly evolving GPU architectures featuring complex memory hierarchies, matrix units, and varied precision formats continue to widen the gap between theoretical peaks and achievable performance. We design and develop analytical performance models for NVIDIA Blackwell (B200) and AMD CDNA3 (MI300A) grounded in systematic microbenchmark characterization. For Blackwell, the mode...


Read full story

The scaling of large language models (LLMs) is currently bottlenecked by the rigidity of distributed programming. While high-performance libraries like CuBLAS and NCCL provide optimized primitives, they lack the flexibility required for rapidly evolving model architectures. Conversely, existing tensor compilers fail to address the complex memory hierarchy of distributed clust...


Read full story