1. 前言
这篇文章继续测试 STM32U3C5 的 HSP 硬件加速能力,验证在 MCU 资源受限环境下部署轻量级人脸检测模型的可行性,
并对比 CPU backend 与 HSP backend 的推理速度、输出一致性和最终检测框结果。
2. 测试平台与软件环境
| 项目 |
内容 |
| MCU |
STM32U3C5 |
| CPU |
Arm Cortex-M33 |
| 主频 |
96 MHz |
| AI 工具链 |
ST Edge AI |
| 加速单元 |
HSP |
| 模型 |
BlazeFace 128×128 |
| 输入格式 |
128×128×3 float32 RGB |
3. 模型转换
- 安装STEdgeAI工具,一路next,安装完成之后,将Utilities目录添加至环境变量
$Env:Path += ";F:\Users\legen\AppData\Local\Programs\STEdgeAI\4.0\Utilities\windows"
测试安装:
PS: > stedgeai --version
ST Edge AI Core v4.0.1-20581 7ed50de05
ISPU 2.0.1-RC2
MLC 1.2.4-RC2
StellarAI 4.0.1-RC2
STM32CubeAI 12.0.1-RC2
- 本次测试使用轻量级 FaceDetect 模型,模型输入为 128×128 RGB 图像,数据类型为 float32,输入数据量为:
128 × 128 × 3 × 4 bytes = 196,608 bytes
模型基础信息如下:
| 项目 |
数值 |
| 输入尺寸 |
128×128×3 |
| 输入类型 |
float32 |
从st-model-zoo下载该模型,使用stedgeai转换,这里我们转换hsp和cpu两种版本,方便后面的精度和性能对比
# hsp
stedgeai.exe generate -m blazeface_front_128_int8.tflite --target stm32u3 --c-api st-ai -O time --hsp 4096
# cpu
stedgeai.exe generate -m blazeface_front_128_int8.tflite --target stm32u3 --c-api st-ai -O time
转换后的统计数据如下:
| 项目 |
数值 |
| MACC |
31,849,356 |
| 权重大小 |
107,956 B |
| 输出大小 |
60,932 B |
| CPU backend nodes |
110 |
| HSP backend nodes |
94 |
| CPU activation |
328,056 B |
| HSP activation |
327,680 B |
从节点数量看,HSP backend 版本经过 ST Edge AI 生成后,网络执行图由 CPU 版的 110 个 node 降为 94 个 node。两版模型的 MACC 相同,均为 31,849,356。
HSP与CPU转换结果:


4. 代码时间
4.1 数据预处理
工程参考STM32U3-GettingStarted-HumanActivityRecognition,为了便于重复测试,我们
先将图片以及anchor转换为数组,转换脚本如下:
def write_c(img: Image.Image, output: Path, source_name: str) -> None:
w, h = img.size
pixels = list(img.getdata())
floats: list[float] = []
for r, g, b in pixels:
floats.extend([r / 255.0, g / 255.0, b / 255.0])
output.parent.mkdir(parents=True, exist_ok=True)
with output.open("w", encoding="utf-8", newline="\n") as f:
f.write("/* Auto-generated by tools/image_to_c.py. Do not edit manually. */\n")
f.write(f"/* Source image: {source_name} */\n")
f.write('#include "facedetect_assets.h"\n\n')
f.write("#if defined(__GNUC__)\n")
f.write("#define FD_ALIGN(x) __attribute__((aligned(x)))\n")
f.write("#define FD_SECTION(name) __attribute__((section(name)))\n")
f.write("#else\n#define FD_ALIGN(x)\n#define FD_SECTION(name)\n#endif\n\n")
f.write('const uint32_t g_fd_image_width FD_SECTION(".fd_image_rodata") = FD_INPUT_W;\n')
f.write('const uint32_t g_fd_image_height FD_SECTION(".fd_image_rodata") = FD_INPUT_H;\n')
f.write('const float g_fd_image_rgb128_f32[FD_IMAGE_FLOAT_COUNT] FD_ALIGN(32) FD_SECTION(".fd_image_rodata") = {\n')
for i, v in enumerate(floats):
f.write(f"{v:.8f}f,")
if (i + 1) % 6 == 0:
f.write("\n")
else:
f.write(" ")
f.write("\n};\n")
anchor生成:
def generate_anchors() -> list[tuple[float, float]]:
anchors: list[tuple[float, float]] = []
for y in range(16):
for x in range(16):
for _ in range(2):
anchors.append(((x + 0.5) / 16.0, (y + 0.5) / 16.0))
for y in range(8):
for x in range(8):
for _ in range(6):
anchors.append(((x + 0.5) / 8.0, (y + 0.5) / 8.0))
assert len(anchors) == 896
return anchors
def write_c(output: Path, anchors: list[tuple[float, float]]) -> None:
output.parent.mkdir(parents=True, exist_ok=True)
with output.open("w", encoding="utf-8", newline="\n") as f:
f.write("/* Auto-generated by tools/generate_anchors.py. Do not edit manually. */\n")
f.write('#include "facedetect_assets.h"\n\n')
f.write("#if defined(__GNUC__)\n")
f.write("#define FD_ALIGN(x) __attribute__((aligned(x)))\n")
f.write("#define FD_SECTION(name) __attribute__((section(name)))\n")
f.write("#else\n#define FD_ALIGN(x)\n#define FD_SECTION(name)\n#endif\n\n")
f.write('const uint32_t g_fd_anchor_count FD_SECTION(".fd_anchor_rodata") = FD_ANCHOR_COUNT;\n')
f.write('const fd_anchor_t g_fd_anchors[FD_ANCHOR_COUNT] FD_ALIGN(32) FD_SECTION(".fd_anchor_rodata") = {\n')
for i, (cx, cy) in enumerate(anchors):
f.write(f" {{{cx:.9f}f, {cy:.9f}f}},")
f.write("\n" if (i + 1) % 4 == 0 else " ")
f.write("};\n")
4.2 Flash 与 SRAM 分区
由于模型输入图片、anchor 数据和 activation buffer 均较大,需要在ld脚本对 Flash 和 SRAM 进行划分。
- Flash 区域
测试图片和 anchor 数据被放入独立 Flash 区域:
| 区域 |
地址 |
用途 |
| FD_IMAGE_FLASH |
0x081B8000 |
128×128×3 float32 图片数组 |
| FD_ANCHOR_FLASH |
0x081F8000 |
896 个 anchor |
| FLASH |
0x08000000 起 |
程序代码、模型权重、只读数据 |
实际运行日志如下:

其中图片数组包含 49,152 个 float:
128 × 128 × 3 = 49,152
anchor 数组包含 896 个 anchor,对应 BlazeFace 的两个输出尺度:
16×16×2 + 8×8×6 = 512 + 384 = 896
ld脚本设置:
.fd_image_rodata : ALIGN(32)
{
. = ALIGN(32);
__fd_image_flash_start__ = .;
*(.fd_image_rodata)
*(.fd_image_rodata*)
. = ALIGN(32);
__fd_image_flash_end__ = .;
} >FD_IMAGE_FLASH
ASSERT(SIZEOF(.fd_image_rodata) <= LENGTH(FD_IMAGE_FLASH), "FaceDetect image data overflow FD_IMAGE_FLASH")
.fd_anchor_rodata : ALIGN(32)
{
. = ALIGN(32);
__fd_anchor_flash_start__ = .;
*(.fd_anchor_rodata)
*(.fd_anchor_rodata*)
. = ALIGN(32);
__fd_anchor_flash_end__ = .;
} >FD_ANCHOR_FLASH
ASSERT(SIZEOF(.fd_anchor_rodata) <= LENGTH(FD_ANCHOR_FLASH), "FaceDetect anchor data overflow FD_ANCHOR_FLASH")
- SRAM 区域
activation buffer 放置在 SRAM 的 AI_RAM 区域:
| backend |
activation 地址 |
activation 大小 |
| HSP |
0x2003F000 |
327,680 B |
| CPU |
0x2003F000 |
328,056 B |
HSP 版运行日志:

CPU 版运行日志:

ld脚本设置:
.ai_activations (NOLOAD) : ALIGN(32)
{
. = ALIGN(32);
__ai_activations_start__ = .;
KEEP(*(.ai_activations))
KEEP(*(.ai_activations*))
KEEP(*(.AI_RAM))
KEEP(*(.AI_RAM*))
. = ALIGN(32);
__ai_activations_end__ = .;
} >AI_RAM
ASSERT(SIZEOF(.ai_activations) <= LENGTH(AI_RAM), "AI activations overflow AI_RAM")
ASSERT(__ai_activations_end__ <= ORIGIN(AI_RAM) + LENGTH(AI_RAM), "AI activations exceed AI_RAM end")
.HSP_DATA_BRAM :
{
__section_static_hsp_data_bram_start__ = .;
*(HSP_DATA_BRAM)
__section_static_hsp_data_bram_end__ = .;
} >HSP_DATA_BRAM
4.3 ST EdgeAI 推理
查看生成的network.h可以发现,和大部分神经网络推理一样,st-api推理流程同样遵循:
init -> set_input -> set_output -> infer -> get_output -> deinit
模仿example,我们的人脸检测模型实现如下:
void FaceDetect_Bench_RunOnce(void)
{
stai_network *net = (stai_network *)g_net_ctx;
stai_return_code rc;
stai_ptr activations[STAI_NETWORK_ACTIVATIONS_NUM] = { (stai_ptr)g_net_activations };
stai_ptr inputs[STAI_NETWORK_IN_NUM] = { 0 };
stai_ptr outputs[STAI_NETWORK_OUT_NUM] = { 0 };
stai_size n_inputs = 0;
stai_size n_outputs = 0;
uint64_t sum_cycles = 0;
uint32_t min_cycles = 0xFFFFFFFFu;
uint32_t max_cycles = 0;
uint32_t hclk_hz = HAL_RCC_GetHCLKFreq();
printf("\r\n[FD] FaceDetect benchmark start\r\n");
printf("CSV,fdbench,backend,%s\r\n", bench_backend_name());
printf("CSV,fdbench,hclk_hz,%lu\r\n", (unsigned long)hclk_hz);
printf("CSV,fdbench,nodes,%lu,macc,%lu,act_bytes,%lu,weights_bytes,%lu,input_bytes,%lu,output_bytes,%lu\r\n",
(unsigned long)STAI_NETWORK_NODES_NUM,
(unsigned long)STAI_NETWORK_MACC_NUM,
(unsigned long)STAI_NETWORK_ACTIVATIONS_SIZE_BYTES,
(unsigned long)STAI_NETWORK_WEIGHTS_SIZE_BYTES,
(unsigned long)STAI_NETWORK_IN_SIZE_BYTES,
(unsigned long)STAI_NETWORK_OUT_SIZE_BYTES);
printf("CSV,fdbench,activation_addr,0x%08lx,size,%lu\r\n",
(unsigned long)(uintptr_t)g_net_activations,
(unsigned long)STAI_NETWORK_ACTIVATIONS_SIZE_BYTES);
printf("CSV,fdbench,image_addr,0x%08lx,width,%lu,height,%lu,float_count,%lu\r\n",
(unsigned long)(uintptr_t)g_fd_image_rgb128_f32,
(unsigned long)g_fd_image_width,
(unsigned long)g_fd_image_height,
(unsigned long)FD_IMAGE_FLOAT_COUNT);
printf("CSV,fdbench,anchor_addr,0x%08lx,count,%lu\r\n",
(unsigned long)(uintptr_t)g_fd_anchors,
(unsigned long)g_fd_anchor_count);
memset(g_net_ctx, 0, sizeof(g_net_ctx));
memset((void *)g_net_activations, 0, STAI_NETWORK_ACTIVATIONS_SIZE_BYTES);
rc = stai_network_init(net);
if (rc != STAI_SUCCESS)
{
printf("CSV,fdbench,error,stai_network_init,%ld\r\n", (long)rc);
return;
}
rc = stai_network_set_activations(net, activations, STAI_NETWORK_ACTIVATIONS_NUM);
if (rc != STAI_SUCCESS)
{
printf("CSV,fdbench,error,stai_network_set_activations,%ld\r\n", (long)rc);
return;
}
rc = stai_network_get_inputs(net, inputs, &n_inputs);
if ((rc != STAI_SUCCESS) || (n_inputs != STAI_NETWORK_IN_NUM) || (inputs[0] == 0))
{
printf("CSV,fdbench,error,stai_network_get_inputs,%ld,%lu\r\n", (long)rc, (unsigned long)n_inputs);
return;
}
rc = stai_network_get_outputs(net, outputs, &n_outputs);
if ((rc != STAI_SUCCESS) || (n_outputs != STAI_NETWORK_OUT_NUM))
{
printf("CSV,fdbench,error,stai_network_get_outputs,%ld,%lu\r\n", (long)rc, (unsigned long)n_outputs);
return;
}
#if FACEDETECT_ENABLE_NODE_TRACE
rc = stai_network_set_callback(net, (stai_event_cb)fd_stai_node_trace_cb, NULL);
if (rc != STAI_SUCCESS)
{
printf("CSV,fdbench,error,stai_network_set_callback,%ld\r\n", (long)rc);
return;
}
printf("CSV,fdbench,node_trace,enabled,max_buffers,%lu\r\n",
(unsigned long)FACEDETECT_NODE_TRACE_MAX_BUFFERS);
#else
printf("CSV,fdbench,node_trace,disabled\r\n");
#endif
printf("CSV,fdbench,io_addr,input0,0x%08lx,out0,0x%08lx,out1,0x%08lx,out2,0x%08lx,out3,0x%08lx\r\n",
(unsigned long)(uintptr_t)inputs[0],
(unsigned long)(uintptr_t)outputs[0],
(unsigned long)(uintptr_t)outputs[1],
(unsigned long)(uintptr_t)outputs[2],
(unsigned long)(uintptr_t)outputs[3]);
if (STAI_NETWORK_IN_1_SIZE != FD_IMAGE_FLOAT_COUNT)
{
printf("CSV,fdbench,error,input_size_mismatch,model,%lu,image,%lu\r\n",
(unsigned long)STAI_NETWORK_IN_1_SIZE,
(unsigned long)FD_IMAGE_FLOAT_COUNT);
return;
}
dwt_counter_init();
for (uint32_t i = 0; i < FACEDETECT_BENCH_WARMUP; ++i)
{
copy_flash_image_to_input((float *)inputs[0], STAI_NETWORK_IN_1_SIZE);
printf("CSV,fdbench,run_begin,phase,warmup,index,%lu\r\n", (unsigned long)i);
rc = stai_network_run(net, (stai_run_mode)0);
printf("CSV,fdbench,run_end,phase,warmup,index,%lu,rc,%ld\r\n", (unsigned long)i, (long)rc);
if (rc != STAI_SUCCESS)
{
printf("CSV,fdbench,error,warmup_run,%lu,%ld\r\n", (unsigned long)i, (long)rc);
return;
}
}
for (uint32_t i = 0; i < FACEDETECT_BENCH_ITERS; ++i)
{
copy_flash_image_to_input((float *)inputs[0], STAI_NETWORK_IN_1_SIZE);
printf("CSV,fdbench,run_begin,phase,bench,index,%lu\r\n", (unsigned long)i);
uint32_t t0 = dwt_counter_get();
rc = stai_network_run(net, (stai_run_mode)0);
uint32_t t1 = dwt_counter_get();
uint32_t dt = dwt_counter_elapsed(t0, t1);
printf("CSV,fdbench,run_end,phase,bench,index,%lu,rc,%ld,cycles,%lu\r\n",
(unsigned long)i, (long)rc, (unsigned long)dt);
if (rc != STAI_SUCCESS)
{
printf("CSV,fdbench,error,bench_run,%lu,%ld\r\n", (unsigned long)i, (long)rc);
return;
}
sum_cycles += (uint64_t)dt;
if (dt < min_cycles) min_cycles = dt;
if (dt > max_cycles) max_cycles = dt;
}
uint32_t avg_cycles = (uint32_t)(sum_cycles / (uint64_t)FACEDETECT_BENCH_ITERS);
printf("CSV,fdbench,result,backend,%s,iters,%lu,avg_cycles,%lu,min_cycles,%lu,max_cycles,%lu,",
bench_backend_name(),
(unsigned long)FACEDETECT_BENCH_ITERS,
(unsigned long)avg_cycles,
(unsigned long)min_cycles,
(unsigned long)max_cycles);
print_ms_from_cycles("avg_ms", avg_cycles, hclk_hz);
printf(",");
print_ms_from_cycles("min_ms", min_cycles, hclk_hz);
printf(",");
print_ms_from_cycles("max_ms", max_cycles, hclk_hz);
printf("\r\n");
for (uint32_t o = 0; o < STAI_NETWORK_OUT_NUM; ++o)
{
const float *p = (const float *)outputs[o];
float s = 0.0f;
uint32_t n = 16u;
for (uint32_t i = 0; i < n; ++i)
{
s += p[i];
}
printf("CSV,fdbench,out_checksum,%lu,%ld\r\n", (unsigned long)o, (long)(s * 1000000.0f));
}
#if FACEDETECT_ENABLE_DECODE
(void)fd_postprocess(outputs);
#endif
(void)stai_network_deinit(net);
}
4.4 后处理
模型输出包含两组 feature map,对应 512 anchors 和 384 anchors。
MCU推理完成后,读取输出张量,对张量进行解码,以及NMS,输出最终人脸框。
static uint32_t fd_postprocess(const stai_ptr outputs[STAI_NETWORK_OUT_NUM])
{
const float *boxes_512 = (const float *)outputs[0];
const float *scores_512 = (const float *)outputs[1];
const float *scores_384 = (const float *)outputs[2];
const float *boxes_384 = (const float *)outputs[3];
uint32_t cand_count = 0;
uint32_t det_count = 0;
const float score_thr = (float)FACEDETECT_SCORE_THR_MILLI / 1000.0f;
const float nms_thr = (float)FACEDETECT_NMS_THR_MILLI / 1000.0f;
if ((g_fd_anchor_count != FD_ANCHOR_COUNT) ||
(boxes_512 == NULL) || (scores_512 == NULL) ||
(scores_384 == NULL) || (boxes_384 == NULL))
{
printf("CSV,face,error,invalid_postprocess_input\r\n");
return 0;
}
for (uint32_t i = 0; i < 512u; ++i)
{
fd_candidate_t c;
if (fd_decode_one(&boxes_512[i * 16u], scores_512[i], &g_fd_anchors[i], &c, score_thr))
{
fd_insert_candidate(c, g_candidates, &cand_count);
}
}
for (uint32_t i = 0; i < 384u; ++i)
{
fd_candidate_t c;
if (fd_decode_one(&boxes_384[i * 16u], scores_384[i], &g_fd_anchors[512u + i], &c, score_thr))
{
fd_insert_candidate(c, g_candidates, &cand_count);
}
}
/* Sort candidates by score descending. */
for (uint32_t i = 0; i < cand_count; ++i)
{
uint32_t best = i;
for (uint32_t j = i + 1u; j < cand_count; ++j)
{
if (g_candidates[j].score > g_candidates[best].score)
{
best = j;
}
}
if (best != i)
{
fd_candidate_t tmp = g_candidates[i];
g_candidates[i] = g_candidates[best];
g_candidates[best] = tmp;
}
}
for (uint32_t i = 0; (i < cand_count) && (det_count < FACEDETECT_MAX_DETECTIONS); ++i)
{
uint32_t keep = 1u;
for (uint32_t j = 0; j < det_count; ++j)
{
if (fd_iou(&g_candidates[i], &g_detections[j]) > nms_thr)
{
keep = 0u;
break;
}
}
if (keep)
{
g_detections[det_count++] = g_candidates[i];
}
}
printf("CSV,face,count,%lu,candidates,%lu,score_thr_milli,%lu,nms_thr_milli,%lu\r\n",
(unsigned long)det_count,
(unsigned long)cand_count,
(unsigned long)FACEDETECT_SCORE_THR_MILLI,
(unsigned long)FACEDETECT_NMS_THR_MILLI);
for (uint32_t i = 0; i < det_count; ++i)
{
uint32_t score_milli = (uint32_t)(g_detections[i].score * 1000.0f + 0.5f);
uint32_t x0_q10000 = (uint32_t)(g_detections[i].x0 * 10000.0f + 0.5f);
uint32_t y0_q10000 = (uint32_t)(g_detections[i].y0 * 10000.0f + 0.5f);
uint32_t x1_q10000 = (uint32_t)(g_detections[i].x1 * 10000.0f + 0.5f);
uint32_t y1_q10000 = (uint32_t)(g_detections[i].y1 * 10000.0f + 0.5f);
uint32_t x0_px = (uint32_t)(g_detections[i].x0 * (float)FD_INPUT_W + 0.5f);
uint32_t y0_px = (uint32_t)(g_detections[i].y0 * (float)FD_INPUT_H + 0.5f);
uint32_t x1_px = (uint32_t)(g_detections[i].x1 * (float)FD_INPUT_W + 0.5f);
uint32_t y1_px = (uint32_t)(g_detections[i].y1 * (float)FD_INPUT_H + 0.5f);
printf("CSV,face,%lu,score_milli,%lu,x0_q10000,%lu,y0_q10000,%lu,x1_q10000,%lu,y1_q10000,%lu,x0_px,%lu,y0_px,%lu,x1_px,%lu,y1_px,%lu\r\n",
(unsigned long)i,
(unsigned long)score_milli,
(unsigned long)x0_q10000,
(unsigned long)y0_q10000,
(unsigned long)x1_q10000,
(unsigned long)y1_q10000,
(unsigned long)x0_px,
(unsigned long)y0_px,
(unsigned long)x1_px,
(unsigned long)y1_px);
}
return det_count;
}
5. 测试结果
5.1 CPU 测试结果
CPU backend 运行日志如下:

CPU backend 连续运行 5 次,结果稳定:
| 指标 |
数值 |
| 平均 cycles |
101,803,366 |
| 最小 cycles |
101,803,333 |
| 最大 cycles |
101,803,405 |
| 平均耗时 |
1060.45 ms |
| 最小耗时 |
1060.45 ms |
| 最大耗时 |
1060.45 ms |
CPU backend 的 4 个输出张量 checksum:
CSV,fdbench,out_checksum,0,51526964
CSV,fdbench,out_checksum,1,-57252232
CSV,fdbench,out_checksum,2,-371083584
CSV,fdbench,out_checksum,3,98565056
5.2 HSP 测试结果
HSP backend 运行日志如下:

HSP backend 连续运行 5 次,结果同样稳定:
| 指标 |
数值 |
| 平均 cycles |
44,581,011 |
| 最小 cycles |
44,580,779 |
| 最大 cycles |
44,581,176 |
| 平均耗时 |
464.38 ms |
| 最小耗时 |
464.38 ms |
| 最大耗时 |
464.38 ms |
HSP backend 的 4 个输出张量 checksum:
CSV,fdbench,out_checksum,0,51526964
CSV,fdbench,out_checksum,1,-57252232
CSV,fdbench,out_checksum,2,-371083584
CSV,fdbench,out_checksum,3,98565056
可以看到,HSP backend 与 CPU backend 的 4 个输出 checksum 完全一致,说明在当前输入下,HSP backend 与 CPU backend 的推理结果一致。
5.3 HSP 与 CPU 性能对比
| 项目 |
CPU backend |
HSP backend |
| nodes |
110 |
94 |
| MACC |
31,849,356 |
31,849,356 |
| activation |
328,056 B |
327,680 B |
| 平均 cycles |
101,803,366 |
44,581,011 |
| 平均耗时 |
1060.45 ms |
464.38 ms |
| 输出 checksum |
一致 |
一致 |
| 检测框结果 |
一致 |
一致 |
HSP 加速比计算如下:
speedup = CPU avg time / HSP avg time
= 1060.45 ms / 464.38 ms
≈ 2.28x
推理耗时降低比例:
reduction = 1 - HSP avg time / CPU avg time
= 1 - 464.38 / 1060.45
≈ 56.2%
因此,在当前 FaceDetect forward-only 测试中,HSP backend 相比 CPU backend 约有 2.28 倍加速,推理耗时降低约 56.2%。
5.4 最终人脸检测结果
本次测试图片上,CPU backend 与 HSP backend 均输出 2 个最终检测框,结果完全一致:
CSV,face,count,2,candidates,5,score_thr_milli,750,nms_thr_milli,300
CSV,face,0,score_milli,864,x0_q10000,3104,y0_q10000,1543,x1_q10000,5069,y1_q10000,3507,x0_px,40,y0_px,20,x1_px,65,y1_px,45
CSV,face,1,score_milli,850,x0_q10000,4668,y0_q10000,3971,x1_q10000,7447,y1_q10000,6750,x0_px,60,y0_px,51,x1_px,95,y1_px,86
检测框结果如下:

小结
本次测试完成了 STM32U3C5 上基于 HSP 的轻量级人脸检测部署与实测, 测试结果表明,STM32U3C5 可以运行 128×128 输入的轻量级 FaceDetect 模型, HSP backend 与 CPU backend 的最终检测框数量、坐标和置信度完全一致。
HSP backend 相比 CPU backend 实现约 2.28 倍加速,推理耗时降低约 56.2%。
综合来看,STM32U3C5 的 HSP 可以有效加速轻量级 CNN 推理任务,并且在当前人脸检测测试中保持与 CPU backend 一致的检测结果。
对于需要在低功耗 MCU 上执行图像类边缘 AI 的场景,HSP 可以显著降低神经网络 forward 延迟,为后续摄像头实时检测、低功耗间歇唤醒检测和本地 AI 预筛选提供了可行基础。