Fixing GPU crashes

Apparently the kernel does not set the correct clock speeds for certain amd gpus. To find out the correct speeds take a look at the Gentoo Wiki entry.

Values for XFX MERC19 RX 6900 XT Black

Base Clock Game Clock Boost Clock Effective Memory Clock Effective VRAM Bus Bandwidth
Specification 1950 MHz 2135 MHz 2365 MHz 2000 MHz (16 GB/s) 256-bit 512 GB/s
sysfs - MHz 2660 MHz 2504/3000 MHz 2150 MHz (16 GB/s) 256-bit 512 GB/s (@1075 MHz)

Memory Clock = base DRAM clock rate * number of channels (1000 MHz * 2 T (double data rate) * 8 = 16_000 MT/s) Double data rate comes from using DDR6 (Double data rate 6)

Bandwidth = data_rate * bus_width / 8 (16_000 MT/s * 256 bits/T = 4_096_000 Mb/s = 512_000 MB/s)

Clock values:

cat /sys/class/drm/card0/device/pp_dpm_sclk
0: 500Mhz *
1: 2660Mhz

Memory clock values:

cat /sys/class/drm/card0/device/pp_dpm_mclk
0: 96Mhz
1: 456Mhz
2: 673Mhz
3: 1000Mhz *

Over drive (boost) clock values:

cat /sys/class/drm/card0/device/pp_od_clk_voltage
OD_SCLK:
0: 500Mhz
1: 2504Mhz
OD_MCLK:
0: 97Mhz
1: 1000MHz
OD_VDDGFX_OFFSET:
-50mV
OD_RANGE:
SCLK:     500Mhz       3000Mhz
MCLK:     674Mhz       1075Mhz

Setting the right values

Testing the values manually

echo 'manual' > /sys/class/drm/card0/device/power_dpm_force_performance_level

Taken from the kernel documentation:

For clock frequency setting, enter a new value by writing a string that contains “s/m index clock” to the file. The index should be 0 if to set minimum clock. And 1 if to set maximum clock. E.g., “s 0 500” will update minimum sclk to be 500 MHz. “m 1 800” will update maximum mclk to be 800Mhz. For core clocks on VanGogh, the string contains “p core index clock”. E.g., “p 2 0 800” would set the minimum core clock on core 2 to 800Mhz.

When you have edited all of the states as needed, write “c” (commit) to the file to commit your changes.

echo 's 1 2365' > /sys/class/drm/card0/device/pp_od_clk_voltage
echo 'm 1 1000' > /sys/class/drm/card0/device/pp_od_clk_voltage

Persisting the settings

I persisted it by adding an rc-service 'overclock' and adding it to the default run level. /etc/init.d/overclock

#!/sbin/openrc-run
# Copyright 2023 Gentoo Authors
# Distributed under the terms of the GNU General Public License v2

name="overclock daemon"
description="overclock / undervolt the amd gpu"
command=/usr/bin/overclock
command_args="${overclock_args}"

depend() {
  need dev-mount
}

stop() {
  ebegin "Resetting gpu clock settings"
  /usr/bin/overclock_stop
}

/usr/bin/overclock

echo 's 1 2365' | tee /sys/class/drm/card0/device/pp_od_clk_voltage > /dev/null
echo 'm 1 1000' | tee /sys/class/drm/card0/device/pp_od_clk_voltage > /dev/null
echo "vo -50" | tee /sys/class/drm/card0/device/pp_od_clk_voltage > /dev/null # this one undervolts by 50mV
echo "c" | tee /sys/class/drm/card0/device/pp_od_clk_voltage > /dev/null

/usr/bin/overclock_stop

echo "r" | tee /sys/class/drm/card0/device/pp_od_clk_voltage > /dev/null