Please turn JavaScript on

hgpu

Want to stay in touch with the latest updates from Hgpu? That's easy! Just subscribe clicking the Follow button below, choose topics or keywords for filtering if you want to, and we send the news to your inbox, to your phone via push notifications or we put them on your personal page here on Specificfeeds.

Reading your RSS feed has never been easier!

Website title: High performance computing on graphics processing units | hgpu.org

Is this your feed? Claim it!

Publisher:  Unclaimed!
Message frequency:  0.95 / day

Message History

Developing efficient CUDA kernels is a fundamental yet challenging task in the generative AI industry. Recent researches leverage Large Language Models (LLMs) to automatically convert PyTorch reference implementations to CUDA kernels, significantly reducing the engineering efforts. State-of-the-art LLMs, such as GPT-5.2 and Claude-Sonnet-4.5, still struggle in this specific t...


Read full story

Writing high-performance GPU kernels is among the most labor-intensive tasks in machine learning systems engineering. We present AutoKernel, an open-source framework that applies an autonomous agent loop to GPU kernel optimization for arbitrary PyTorch models. Given a model, AutoKernel profiles it to identify computational bottlenecks, ranks them by Amdahl's law impa...


Read full story

The explosive demand for artificial intelligence (AI) workloads has led to a significant increase in silicon area dedicated to lower-precision computations on recent high-performance computing hardware designs. However, mixed-precision capabilities, which can achieve performance improvements of up to 8x compared to double-precision in extreme compute-intensive workloads, rema...


Read full story

Modern applications often involve complex, structured or data-parallel computations on large datasets. Traditionally, GPUs have served as the primary accelerators for such tasks, mostly through compute-focused models like CUDA and OpenCL. Vulkan is a more recent cross-platform API, widely adopted for both high-performance graphics and compute. These models require lower-level...


Read full story

We present LLMQ, an end-to-end CUDA/C++ implementation for medium-sized language-model training, e.g. 3B to 32B parameters, on affordable, commodity GPUs. These devices are characterized by low memory availability and slow communication compared to datacentre-grade GPUs. Consequently, we showcase a range of optimizations that target these bottlenecks, including activation che...


Read full story