Fixing GPU crashes
Apparently the kernel does not set the correct clock speeds for certain amd gpus. To find out the correct speeds take a look at the Gentoo Wiki entry.
Values for XFX MERC19 RX 6900 XT Black
| Base Clock | Game Clock | Boost Clock | Effective Memory Clock | Effective VRAM Bus | Bandwidth | |
|---|---|---|---|---|---|---|
| Specification | 1950 MHz | 2135 MHz | 2365 MHz | 2000 MHz (16 GB/s) | 256-bit | 512 GB/s |
| sysfs | - MHz | 2660 MHz | 2504/3000 MHz | 2150 MHz (16 GB/s) | 256-bit | 512 GB/s (@1075 MHz) |
Memory Clock = base DRAM clock rate * number of channels (1000 MHz * 2 T (double data rate) * 8 = 16_000 MT/s) Double data rate comes from using DDR6 (Double data rate 6)
Bandwidth = data_rate * bus_width / 8 (16_000 MT/s * 256 bits/T = 4_096_000 Mb/s = 512_000 MB/s)
Clock values:
cat /sys/class/drm/card0/device/pp_dpm_sclk
0: 500Mhz *
1: 2660Mhz
Memory clock values:
cat /sys/class/drm/card0/device/pp_dpm_mclk
0: 96Mhz
1: 456Mhz
2: 673Mhz
3: 1000Mhz *
Over drive (boost) clock values:
cat /sys/class/drm/card0/device/pp_od_clk_voltage
OD_SCLK:
0: 500Mhz
1: 2504Mhz
OD_MCLK:
0: 97Mhz
1: 1000MHz
OD_VDDGFX_OFFSET:
-50mV
OD_RANGE:
SCLK: 500Mhz 3000Mhz
MCLK: 674Mhz 1075Mhz
Setting the right values
Testing the values manually
echo 'manual' > /sys/class/drm/card0/device/power_dpm_force_performance_level
Taken from the kernel documentation:
For clock frequency setting, enter a new value by writing a string that contains “s/m index clock” to the file. The index should be 0 if to set minimum clock. And 1 if to set maximum clock. E.g., “s 0 500” will update minimum sclk to be 500 MHz. “m 1 800” will update maximum mclk to be 800Mhz. For core clocks on VanGogh, the string contains “p core index clock”. E.g., “p 2 0 800” would set the minimum core clock on core 2 to 800Mhz.
When you have edited all of the states as needed, write “c” (commit) to the file to commit your changes.
echo 's 1 2365' > /sys/class/drm/card0/device/pp_od_clk_voltage
echo 'm 1 1000' > /sys/class/drm/card0/device/pp_od_clk_voltage
Persisting the settings
I persisted it by adding an rc-service 'overclock' and adding it to the default run level.
/etc/init.d/overclock
#!/sbin/openrc-run
# Copyright 2023 Gentoo Authors
# Distributed under the terms of the GNU General Public License v2
name="overclock daemon"
description="overclock / undervolt the amd gpu"
command=/usr/bin/overclock
command_args="${overclock_args}"
depend() {
need dev-mount
}
stop() {
ebegin "Resetting gpu clock settings"
/usr/bin/overclock_stop
}
/usr/bin/overclock
echo 's 1 2365' | tee /sys/class/drm/card0/device/pp_od_clk_voltage > /dev/null
echo 'm 1 1000' | tee /sys/class/drm/card0/device/pp_od_clk_voltage > /dev/null
echo "vo -50" | tee /sys/class/drm/card0/device/pp_od_clk_voltage > /dev/null # this one undervolts by 50mV
echo "c" | tee /sys/class/drm/card0/device/pp_od_clk_voltage > /dev/null
/usr/bin/overclock_stop
echo "r" | tee /sys/class/drm/card0/device/pp_od_clk_voltage > /dev/null