Better Raspberry Pi server performance #2172

chewi · 2024-02-25T17:39:34Z

Description

Now I reveal what I really want to use Sunshine for. As a server on the Raspberry Pi! Why would I want such a thing? Surely it makes more sense as a client? Normally yes, but when combined with the PiStorm project, things get very interesting.

As you might imagine, PiStorm is very CPU-intensive, so for this to be feasible, Sunshine needs to use as little CPU as possible. The first step here was obviously to get hardware video encoding to work. The Pi does not support VAAPI or CUDA, but fortunately, this still turned out to be very easy.

These initial changes to add a V4L2M2M encoder did not work for me at first, as Sunshine claimed that an IDR frame was not produced. Digging around in the internals, it looked very much to me like requesting IDR frames should work on the Pi. As a shot in the dark, I applied John Cox's ffmpeg patchset for the Raspberry Pi. This patchset, which I recently applied to Gentoo's ffmpeg package, enables efficient zero-copy video playback on the Pi. With this, I have seen 1080p videos go from a stuttery mess to being buttery smooth. Being playback-focused, I really didn't expect it to help, but I was delighted when it suddenly sprang to life!

[2024:02:25:17:15:54]: Info: Found H.264 encoder: h264_v4l2m2m [V4L2M2M]
[2024:02:25:17:15:54]: Info: Executing [Desktop]
[2024:02:25:17:15:54]: Info: CLIENT CONNECTED
[2024:02:25:17:15:54]: Warning: No render device name for: /dev/dri/card1
[2024:02:25:17:15:55]: Error: Couldn't expose some/all drm planes for card: /dev/dri/card0
[2024:02:25:17:15:55]: Info: Screencasting with KMS
[2024:02:25:17:15:55]: Warning: No render device name for: /dev/dri/card1
[2024:02:25:17:15:55]: Info: Found monitor for DRM screencasting
[2024:02:25:17:15:55]: Info: Found connector ID [32]
[2024:02:25:17:15:55]: Info: Found cursor plane [309]
[2024:02:25:17:15:55]: Info: SDR color coding [Rec. 601]
[2024:02:25:17:15:55]: Info: Color depth: 8-bit
[2024:02:25:17:15:55]: Info: Color range: [MPEG]
[2024:02:25:17:15:55]: Info: [h264_v4l2m2m @ 0x7f58002160]  <<< v4l2_encode_init: fmt=0/0
[2024:02:25:17:15:55]: Info: [h264_v4l2m2m @ 0x7f58002160] Using device /dev/video11
[2024:02:25:17:15:55]: Info: [h264_v4l2m2m @ 0x7f58002160] driver 'bcm2835-codec' on card 'bcm2835-codec-encode' in mplane mode
[2024:02:25:17:15:55]: Info: [h264_v4l2m2m @ 0x7f58002160] requesting formats: output=YU12/yuv420p capture=H264/none

The quality isn't fantastic though, and it's still using 275% CPU. I utilised gprof to find where it's spending all the effort.

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls   s/call   s/call  name    
 51.88     10.78    10.78                             ff_hscale16to15_X4_neon_asm
 18.48     14.62     3.84                             ff_yuv2planeX_8_neon
 13.47     17.42     2.80   156694     0.00     0.00  bgr32ToUV_half_c
 11.98     19.91     2.49   155935     0.00     0.00  bgr32ToY_c
  0.67     20.05     0.14                             ff_hscale16to15_4_neon_asm
  0.53     20.16     0.11      142     0.00     0.00  std::back_insert_iterator<std::vector<unsigned char, std::allocator<unsigned char> > > std::__copy_move_a1<false, char const*, std::back_insert_iterator<std::vector<unsigned char, std::allocator<unsigned char> > >
 >(char const*, char const*, std::back_insert_iterator<std::vector<unsigned char, std::allocator<unsigned char> > >)
  0.43     20.25     0.09      284     0.00     0.02  scale_internal
  0.38     20.33     0.08    94424     0.00     0.00  chr_planar_vscale
  0.29     20.39     0.06    38454     0.00     0.00  chr_convert
  0.29     20.45     0.06    38383     0.00     0.00  chr_h_scale
  0.24     20.50     0.05      577     0.00     0.00  yuv2planeX_8_c
  0.19     20.54     0.04    60422     0.00     0.00  lum_convert
  0.19     20.58     0.04        1     0.04     2.95  video::capture_async(std::shared_ptr<safe::mail_raw_t>, video::config_t&, void*)
  0.14     20.61     0.03     2133     0.00     0.00  lumRangeToJpeg_c
  0.14     20.64     0.03                             _init
  0.10     20.66     0.02   103963     0.00     0.00  lum_planar_vscale
  0.10     20.68     0.02       24     0.00     0.00  alloc_gamma_tbl
  0.05     20.69     0.01   955063     0.00     0.00  av_pix_fmt_desc_get
  0.05     20.70     0.01    59959     0.00     0.00  lum_h_scale
  0.05     20.71     0.01     6502     0.00     0.00  obl_axpy
  0.05     20.72     0.01     2148     0.00     0.00  chrRangeToJpeg_c
  0.05     20.73     0.01     2081     0.00     0.00  std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release()
  0.05     20.74     0.01      483     0.00     0.00  av_frame_unref
  0.05     20.75     0.01      376     0.00     0.00  stream::control_server_t::call(unsigned short, stream::session_t*, std::basic_string_view<char, std::char_traits<char> > const&, bool)
  0.05     20.76     0.01        3     0.00     0.00  video::avcodec_encode_session_t::request_idr_frame()
  0.05     20.77     0.01                             av_bprint_escape
  0.05     20.78     0.01                             ff_hscale16to19_X4_neon_asm
  0.00     20.78     0.00   463475     0.00     0.00  ff_hscale16to15_X4_neon
  0.00     20.78     0.00   103794     0.00     0.00  ff_rotate_slice
  0.00     20.78     0.00    28314     0.00     0.00  av_opt_next
  0.00     20.78     0.00    11496     0.00     0.00  av_bprint_init
  0.00     20.78     0.00     9141     0.00     0.00  av_buffer_unref
  0.00     20.78     0.00     7378     0.00     0.00  glad_gl_get_proc_from_userptr
  0.00     20.78     0.00     7306     0.00     0.00  enet_list_clear
  0.00     20.78     0.00     7184     0.00     0.00  enet_protocol_send_outgoing_commands
  0.00     20.78     0.00     6975     0.00     0.00  enet_time_get
  0.00     20.78     0.00     6812     0.00     0.00  config::whitespace(char)
  0.00     20.78     0.00     6433     0.00     0.00  ff_hscale16to15_4_neon

This is not my area of expertise, but it looks like finding the right format might be the key here. I'd appreciate any help you can provide here. I know that John Cox's patchset adds support for Pi-specific SAND formats, but I don't know whether they are usable in this context.

Type of Change

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Dependency update (updates to dependencies)
Documentation update (changes to documentation)
Repository update (changes to repository files, e.g. .github/...)

Checklist

My code follows the style guidelines of this project
I have performed a self-review of my own code
I have commented my code, particularly in hard-to-understand areas
I have added or updated the in code docstring/documentation-blocks for new or existing methods/components

Branch Updates

LizardByte requires that branches be up-to-date before merging. This means that after any PR is merged, this branch
must be updated before it can be merged. You must also
Allow edits from maintainers.

I want maintainers to keep my branch updated

cgutman · 2024-02-25T19:33:17Z

src/video.cpp

+      {},
+      // Fallback options
+      {},
+      std::make_optional<encoder_t::option_t>("qp"s, &config::video.qp),


Does the encoder really not support CBR/VBR bitrate control? QP shouldn't be provided if CBR or VBR is available.

Probably, I was just copying what the others did as a first step. I'll give it a try.

cgutman · 2024-02-25T20:08:16Z

As a server on the Raspberry Pi!

This is a Pi 4, I assume? I don't think the Pi 5 has any hardware encoders anymore.

This is not my area of expertise, but it looks like finding the right format might be the key here. I'd appreciate any help you can provide here. I know that John Cox's patchset adds support for Pi-specific SAND formats, but I don't know whether they are usable in this context.

Yeah, it's all in the RGB->YUV color conversion code, which is expected since it's doing all the color conversion on the CPU. I guess it's nice that's multi-threaded now. You can adjust the "Minimum CPU Thread Count" on the Advanced tab in the UI if you want to play with the amount of concurrency there.

What your encoding pipeline looks like now:
RGB framebuffer DMA-BUF from KMS capture -> import to EGL (eglCreateImage) -> readback from EGL to CPU (glGetTextureSubImage) -> RGB to YUV conversion and scaling (libswscale) -> upload to DMA-BUF again -> encode the DMA-BUF

What you want is more like what we do with VAAPI:
RGB framebuffer DMA-BUF from KMS capture -> import to EGL (eglCreateImage) -> render using color conversion shaders into another DMA-BUF -> pass that DMA-BUF (AV_PIX_FMT_DRM_PRIME) to h264_v4l2m2m.

Most of that pipeline is simple and already written in Sunshine. The tricky part will be getting that second DMA-BUF to write into and/or exporting the render target as a DMA-BUF. Since there's no standard way to create a DMA-BUF, that part tends to be highly API-specific. For VAAPI, we import the underlying DMA-BUF of the VA surface as the render target for our color conversion. For CUDA, we create a blank texture to use as the render target and use the CUDA-GL interop APIs to import that texture as a CUDA resource for NVENC to read.

Where to start is probably writing something like this for AV_HWDEVICE_TYPE_DRM and using that in your encoder.

Then for your encoder definition you probably want something like this:

    std::make_unique<encoder_platform_formats_avcodec>(
      AV_HWDEVICE_TYPE_DRM, AV_HWDEVICE_TYPE_NONE,
      AV_PIX_FMT_DRM_PRIME,
      AV_PIX_FMT_NV12, AV_PIX_FMT_P010,
      drm_init_avcodec_hardware_input_buffer),

Since FFmpeg's hwcontext_drm.c doesn't support frame allocation, you'll need to figure out how to do that and provide a buffer pool for frame allocation.

Finally, for encoding side, you'll want to do something similar to what I did in 8182f59 for supporting KMS->GL->CUDA with the gl_cuda_vram_t and make_avcodec_gl_encode_device.

chewi · 2024-02-25T20:57:17Z

Many thanks for the detailed reply. Sounds like this could be an interesting exercise. I may be wrong, but I think playback scenarios have managed to avoid GL altogether. What Kodi calls Direct to Plane and mpv calls HW-overlay? Is that not possible here?

cgutman · 2024-02-25T21:12:24Z

I think that color conversion hardware is only accessible on the scanout path (and it's YUV->RGB, not RGB->YUV). Some encoders do have the ability to accept RGB frames and perform the conversion to YUV internally (using dedicated hardware or a shader), but I don't think the Pi's encoder supports RGB input.

Add v4l2m2m encoder

d722a83

chewi force-pushed the rpi branch from 76136e8 to d722a83 Compare February 25, 2024 17:40

cgutman reviewed Feb 25, 2024

View reviewed changes

ReenigneArcher added this to the adjust lint rules milestone Feb 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Better Raspberry Pi server performance #2172

Better Raspberry Pi server performance #2172

chewi commented Feb 25, 2024 •

edited

cgutman Feb 25, 2024

chewi Feb 25, 2024

cgutman commented Feb 25, 2024

chewi commented Feb 25, 2024 •

edited

cgutman commented Feb 25, 2024

Better Raspberry Pi server performance #2172

Are you sure you want to change the base?

Better Raspberry Pi server performance #2172

Conversation

chewi commented Feb 25, 2024 • edited

Description

Type of Change

Checklist

Branch Updates

cgutman Feb 25, 2024

Choose a reason for hiding this comment

chewi Feb 25, 2024

Choose a reason for hiding this comment

cgutman commented Feb 25, 2024

chewi commented Feb 25, 2024 • edited

cgutman commented Feb 25, 2024

chewi commented Feb 25, 2024 •

edited

chewi commented Feb 25, 2024 •

edited