Thursday, March 5, 2026
South Korean AI chip startup HyperAccel is preparing to launch its Bertha 500 chip, an LLM inference accelerator designed for economical token generation in the data center. The company already has an FPGA-based server on the market with a data center chip and an edge chip imminent.
Startup competitors in this field have found success delivering very fast tokens, attacking a perceived weakness in incumbent GPU architectures—their single-user token speeds. By contrast, rather than compete directly on performance, HyperAccel’s key value proposition is based on economics, Yongwoong Jung, chief strategy officer at HyperAccel, told EE Times.
“We are trying to be a more affordable provider… that’s why we chose LPDDR, which is only one-tenth of HBM’s bandwidth, but since we are utilizing that bandwidth twice as well as GPUs, and because of the architecture of our computation units, we can produce 5× more tokens per second [for the same amount of TOPS],” Jung said. “That’s how we overcome the weakness of our DRAM bandwidth, but we still achieve value for money; that’s our value proposition.”
Making better use of DRAM bandwidth means HyperAccel has perfectly good performance at human-readable speeds, a key target application for LLMs today. The most expensive GPUs are often overkill in this scenario, Jung said.
“Our approach is to reduce the cost, sacrificing a little performance if needed, but targeting a very large market,” he said. “For current GPU products, only big companies can use them because of the price.”
That said, even big companies such as OpenAI have requirements for cheaper hardware so they can service users still at the free tier, Jung said. The result will be an increasingly heterogeneous AI data center.
“We are not trying to replace GPUs for the entire world, we are trying to find our own sweet spot,” Jung said. “Whether it’s the prefill stage or decode stage, or it could be the bigger model or smaller model—we are trying to find the sweet spots.”
FPGA-based server
HyperAccel was founded by KAIST Professor Jooyoung Kim, along with a group of his students at the beginning of 2023. After presenting at Hot Chips in 2023, the group received an offer for their AI accelerator IP, but chose instead to become a chip company and decided to raise a seed round, HyperAccel cofounder Seungjae Moon told EE Times.
HyperAccel’s first product is an FPGA-based server, Orion, with the company’s AI accelerator chip IP. FPGAs are fairly resource-limited by AI standards, but Orion was sufficient to get the attention of some big tech companies such as Korean hyperscaler Naver Cloud, with whom the company now has a joint development agreement, Moon said.
“We wanted to understand their needs instead of just creating the highest spec product we can make,” he said.
The startup also has a partnership with LG to make an edge chip for on-device AI acceleration.
Architecture
The key differences between HyperAccel’s LPU and leading GPUs lie in its use of LPDDR instead of expensive HBM, compensating for lower bandwidth by achieving around 90% memory bandwidth utilization. This is done largely by eliminating traditional memory hierarchies, Moon said. Further efficiencies come from specializing in inference and transformer/LLM workloads.
“GPUs have a huge structural mismatch [with LLM inference],” Moon said. “When running LLM inference, they are only able to achieve around 45% memory bandwidth utilization because of their complex hierarchy—going from memory to compute cores needs to go all the way through the hierarchy. They also have too many compute units for what LLM inference needs, so they only achieve around 30% compute utilization. And because they are too highly spec’d [for inference], they have a high price.”
HyperAccel has closely matched memory bandwidth to compute so that data can be streamed in quickly rather than having to go through caches. Local memory units are exactly sized for LLM inference, and the instruction scheduling unit is able to stream all the AI model data without any stalling, Moon said.
GPUs also require data to be reformatted or reshaped between HBM and SRAM, Moon said, whereas HyperAccel stores formatted data in its DRAM, which can be loaded directly into compute, bypassing SRAM, and avoiding any back-and-forth. HyperAccel also uses one large compute core instead of many small cores. These architectural features mean the company can get more tokens per second from less compute—around 5× the tokens per second when normalized to the amount of compute power, relative to Nvidia Hopper-generation GPUs, Moon said.
Bertha 500 has taped out on Samsung 4 nm. It will offer 768 TOPS (INT8, but also support FP16 and other 16, 8, and 4-bit formats) from 32 LPU cores with 256 MB SRAM. There are also quad Arm Cortex-A53 cores on the chip. It has DRAM bandwidth of 560 GB/s (8 channels of LPDDR5x). Batch sizes up to 1024 are supported.
The result should be around 20× the throughput per dollar versus an Nvidia H100 (the cost will be around one-tenth) and around 5× the power efficiency. Bertha 500 will run on around 250 W.
Future generations of the architecture may look at processor-in-memory technologies to help the decode stage get even closer to memory, Moon said.
System and software
For large models, accelerator-to-accelerator communication is required. GPUs can connect directly to each other using protocols like NVLink, but since they are programmed with kernels, they also require a runtime system call, which means there still has to be some communication with the host CPU. HyperAccel’s architecture doesn’t require any intervention from the host since the chip already knows where and when memory transitions need to happen, a side-effect of being LLM-specific. This transfer is controlled by a memory controller on the chip.
HyperAccel’s ESLink (expandable synchronization link, analogous to NVLink), which connects accelerator chips, can overlap communication and computation, because it knows when everything needs to happen. This makes for more scalability, Moon said.
HyperAccel’s software stack supports all models in the HuggingFace repository, and inference serving engine vLLM. The company is working on a domain-specific language (DSL) it calls Legato, which will give developers access to the lower levels of the stack. There will also be AI agents available to help them do this, including learning Legato, Moon said, once Bertha 500 is released.
Edge SoC coming
As well as Bertha 500, HyperAccel is also creating a scaled-down edge version for applications including automotive, consumer electronics, and robotics, as part of a joint development agreement with LG Electronics. This chip will be able to handle text-to-speech or speech-to-text models, for example.
The SoC jointly developed with LG will use HyperAccel’s accelerator IP paired with some of LG’s in-house IP (potentially blocks like PHY and memory controller IP) and an Arm Cortex-A55, with LG providing backend services and HyperAccel doing design services. (This is the first time LG is providing backend services to a third party, Jung said). HyperAccel will sell this chip into edge applications beyond LG. Dubbed Bertha 100, with the numbers relating to memory bandwidth, not compute cores, the SoC will use two channels of LPDDR5x. Samples are due in the fourth quarter of 2026, and the accelerator will come on an M.2 card.
HyperAccel has raised $45 million so far and has a team of 77. Bertha 500 samples are due around the end of the first quarter of 2026, with mass production due to start in early 2027.
By: DocMemory Copyright © 2023 CST, Inc. All Rights Reserved
|