Please turn JavaScript on

hgpu

Want to stay in touch with the latest updates from Hgpu? That's easy! Just subscribe clicking the Follow button below, choose topics or keywords for filtering if you want to, and we send the news to your inbox, to your phone via push notifications or we put them on your personal page here on Specificfeeds.

Reading your RSS feed has never been easier!

Website title: High performance computing on graphics processing units | hgpu.org

Is this your feed? Claim it!

Publisher:  Unclaimed!
Message frequency:  5.4 / week

Message History

Ensuring the correctness of compiler optimizations is critical, but existing fuzzers struggle to test optimizations effectively. First, most fuzzers use optimization pipelines (heuristics-based, fixed sequences of passes) as their harness. The phase-ordering problem can enable or preempt transformations, so pipelines inevitably miss optimization interactions; moreover, many o...

Read full story

We present tritonBLAS, a fast and deterministic analytical model that uses architectural parameters like the cache hierarchy, and relative code and data placement to generate performant GPU GEMM kernels. tritonBLAS explicitly models the relationship between architectural topology, matrix shapes, and algorithmic blocking behavior to predict near-optimal configurations without ...

Read full story

As GPU architectures rapidly evolve to meet the overcoming demands of exascale computing and machine learning, the performance implications of architectural innovations remain poorly understood across diverse workloads. NVIDIA’s Blackwell (B200) generation introduce significant architectural advances including the 5th generation tensor cores, tensor memory (TMEM), decompressi...

Read full story

We present hls4ml, a free and open-source platform that translates machine learning (ML) models from modern deep learning frameworks into high-level synthesis (HLS) code that can be integrated into full designs for field-programmable gate arrays (FPGAs) or application-specific integrated circuits (ASICs). With its flexible and modular design, hls4ml supports a large number of...

Read full story

Machine-learning (ML) applications frequently utilize high-performance ML kernels to execute tensor operations like matrix product and softmax. An ML kernel can be decomposed into two components: the implicit algorithm, which defines the tensor operation that computes the output tensor, and the schedule, which defines how the operation is implemented. The schedule of an ML ke...

Read full story