RAD Open-Source Projects
AMD has an enduring commitment to advance the state of the art through contributions to open-source software. AMD Research and Development (RAD) continues this tradition with contributions such as implementations of open standards and task-driven computing for GPUs and CPUs. These contributions are intended to give programmers the tools required to extract the maximum performance from today’s complex computer architectures and to encourage other researchers to build on these tools to develop high-performance systems of tomorrow. AMD invites you to contribute to the world of open-source software.
Featured Projects
Brevitas
Brevitas is a quantization library for PyTorch which supports Integer and Minifloat datatypes while applying various restrictions to support the requirements of a target inference toolchain. Brevitas supports multiple entry-points with varying levels of control, allowing use by quantization novices to experts. It supports various PTQ algorithms as well as QAT, giving a machine learning engineer many levers to achieve maximum accuracy at minimal bit widths. After quantization, Brevitas-quantized models can be exported to standard formats including: ONNX & TorchScript. Brevitas also interoperates with various HuggingFace tools (in particular, Optimum-AMD) to allow seamless integration with models on HuggingFace Hub.
FINN
FINN is a machine learning framework by the Integrated Communications and AI Lab of AMD Research & Advanced Development. It provides an end-to-end flow for the exploration and implementation of quantized neural network inference solutions on FPGAs. FINN generates dataflow architectures as a physical representation of the implemented custom network in space. It is not a generic DNN acceleration solution but relies on co-design and design space exploration for quantization and parallelization tuning to optimize a solution with respect to resource and performance requirements.
PYNQ
PYNQ™ is an open-source project from AMD® that makes it easier to use Adaptive Computing platforms.
Using the Python language, Jupyter notebooks, and the huge ecosystem of Python libraries, designers can exploit the benefits of programmable logic and microprocessors to build more capable and exciting electronic systems.
Useful links
Riallto
Riallto is an open-source exploration framework for first time users of the AMD Ryzen AI Neural Processing Unit (NPU).
Useful links
ACCL
ACCL is an open-source versatile FPGA-based collective communication library. Portable across different platforms and supporting UDP, TCP, as well as RDMA, ACCL empowers FPGA applications to initiate direct FPGA-to-FPGA collective communication. Additionally, it can serve as a collective offload engine for CPU/GPU applications, freeing the compute engine from executing networking tasks. ACCL is user-extensible, allowing new collectives to be implemented and deployed without having to re-synthesize the FPGA circuit. It is being utilized as key infrastructure for application acceleration at the Paderborn and ETH Zurich HACCs, with use-cases including distributed DLRM inference (up to 10 FPGAs) and shallow water simulation (up to 48 FPGAs).
Useful links
LogicNets
LogicNets is a PyTorch-based library for neural network topology and hardware co-design, for extreme-throughput applications. The key idea is that dot-products can be mapped directly to lookup tables (LUTs), which are a numerous resource of FPGAs. LogicNets includes various tools to define sparse, activation-quantized neural networks which are amenable to being mapped to LUTs, as well as tools to map the resultant trained network directly to Verilog. LogicNets networks are fully unrolled, meaning they produce an inference result every clock cycle. LogicNets networks can achieve competitive accuracy at >600 M inf/sec for only ~100 of LUTs on several public datasets.
RecoNIC
RecoNIC is an open-sourced 100Gbps FPGA-based SmartNIC infrastructure/testbed that incorporates RDMA and compute offloading capabilities. It features an RDMA offloading engine specifically designed for high-throughput and low-latency data transfers. This engine is shared by both host CPUs and compute accelerators, which makes RecoNIC a very flexible platform. Moreover, developers have the flexibility to design their accelerators using RTL, HLS or Vitis Networking P4 within the RecoNIC's programmable compute blocks. These compute blocks have the capability to access host/device memory as well as host/device memory in remote peers through the RDMA offloading engine.
Useful links
QONNX
QONNX is a set of custom operators and tools to work with arbitrary-precision uniform quantization in ONNX, co-maintained with the Fast ML for Science collective. These custom operators enable representing various types of quantization and for quantization to be applied to any parameter or operator. The QONNX repository itself includes various Python utilities for working with QONNX models such as executing models, optimizations, summarizing inference cost, and defining executable custom operations. QONNX is used by several NN-to-FPGA toolchains including FINN and hls4ml.
OpenNIC
The OpenNIC project provides an FPGA-based NIC platform for the open-source community. It consists of multiple components: a NIC shell, a Linux kernel driver, and a DPDK driver. The NIC shell contains the RTL sources and design files for targeting several of the AMD-Xilinx Alveo boards featuring UltraScale+ FPGAs. OpenNIC has been used in research collaborations with leading universities and it has deployed by customers in the data center and other high speed network environments. For example, the Department of Energy's Energy Science Network's ESnet SmartNIC is based on a version of OpenNIC. The ESnet SmartNIC is deployed throughout the wide area network for distributed science called esnet6 and the NSF Fabric project connecting to a dozen universities for collaborative research.
P2P
The P2P project extends the Coyote framework from the Systems Group at ETHZ, to enable communication and control between GPU and FPGAs. P2P allows AMD GPUs to launch kernels on FPGAs and for FPGAs to access GPU memory directly, bypassing the CPU subsystem. The benefit of the work is an increased throughput and lower latency for application executing heterogeneous workloads. The work is currently being integrated by ETHZ into Coyote V2.
Omnitrace
Omnitrace is a comprehensive profiling and tracing tool for parallel applications written in C, C++, Fortran, HIP, OpenCL, and Python which execute on the CPU or CPU+GPU. It can gather the performance information of functions through any combination of binary instrumentation, call-stack sampling, user-defined regions, and Python interpreter hooks. Omnitrace supports interactive visualization of comprehensive traces in the web browser in addition to high-level summary profiles with mean/min/max/stddev statistics. In addition to runtimes, Omnitrace supports the collection of system-level metrics such as the CPU frequency, GPU temperature, and GPU utilization, process-level metrics such as the memory usage, page-faults, and context-switches, and thread-level metrics such as memory usage, CPU time, and numerous hardware counters.
ROC_SHMEM
The ROCm OpenSHMEM (ROC_SHMEM) runtime is part of an AMD Research initiative to provide GPU-centric distributed programming models that allow the GPU kernel to perform network operation directly without the intervention of the CPU in the critical path. ROC_SHMEM supports OpenSHMEM1.4, an industry-standard API that allows a GPU kernel to directly make network calls and move data between nodes without handing control back to a CPU. By removing the CPU from the critical path, ROC_SHMEM can reduce the kernel-launch overhead and maximize the computation-communication overlap. Intra-kernel networking can simplify application code and enables finer-grained overlap of communication and computation than traditional host-driven networking.