Machine learning rig
Posted on February 28, 2024 • 6 minutes • 1082 words
Building a machine learning rig
This is my notes while building a machine learning rig.
After a bunch of research, I ended up with the following specs:
- CPU: Threadripper PRO 3995WX .
- Mainboard: Supermicro M12SWA-TF .
- Cooler: Enermax TR4 500W
- GPU: 2x NVIDIA 3090 24GB Founder Edition .
- RAM: 256GB (8x 32GB stick) memory ECC DDR4 3200Mhz.
- Storage: 2TB SSD Samsung 970 EVO Plus .
- PSU: Super Flower 1200W 80 Plus .
- Case: LIAN-LI Lancool 216 Mesh Black .
- OS: Arch Linux
Picture
CPU
You need one with multiple PCIe lanes, which usually found in server CPUs since they have lots of memory. CPU specs doesn’t matter much for ML rig. I could have gone with Threadripper 1st gen and it would probably be fine.
GPU
What you want is lots of VRAM. I figure with 2x 3090, I would be able to run 70B model with 4-bit quantization. Maybe I can add another card later and would be enough for 70B model + 8bit quantization.
General rule of thumb:
Take the parameter count in billions and multiply it by 2. This will tell you roughly how many gigs the full sized model requires.
The parameter count in billions is roughly equal to the 8-bit quants.
The parameter count in billions divided by two is roughly equal to the 4-bit quants.
So 70B model will have the following sizes:
Full = 140 gig
8-bit = 70gig
4-bit = 35 gig
Source: Reddit
PSU
I went with 1200W PSU. While others say that it’s fine with 2x 3090, I also went ahead and undervolt the 3090 to 280w limit. Make sure you have persistence mode turned on.
Minor, ignorable performance drop but greatly decrease thermal & power consumption.
- First, update
nvidia-persistenced.service
topersistence-mode
. - Create a new systemd service to set power limit on startup. Make sure it start after
nvidia-persistenced.service
.
- Verify it’s working after boot. It should looks like this with
nvidia-smi
command. Notice the Persistence-M is On and power limit is 280W.
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 545.29.06 Driver Version: 545.29.06 CUDA Version: 12.3 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA GeForce RTX 3090 On | 00000000:01:00.0 On | N/A |
| 0% 34C P8 22W / 280W | 664MiB / 24576MiB | 2% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 1 NVIDIA GeForce RTX 3090 On | 00000000:21:00.0 Off | N/A |
| 0% 26C P8 16W / 280W | 12MiB / 24576MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| 0 N/A N/A 3485 G /usr/lib/xorg/Xorg 225MiB |
| 0 N/A N/A 3619 G /usr/bin/gnome-shell 216MiB |
| 0 N/A N/A 5169 G firefox 207MiB |
| 1 N/A N/A 3485 G /usr/lib/xorg/Xorg 4MiB |
+---------------------------------------------------------------------------------------+
Case
Any ATX case would do. I went with LIAN-LI Lancool 216. Most ATX desktop case would be able to house 2x 3090 cards. If you want to use 3 cards, you will have to use a smaller card, or bigger case.
OS
I went with Pop!_OS at first. It’s pretty much Ubuntu but better. Pop!_OS offers a NVIDIA variant with some NVIDIA-related stuff baked in.
But in the end, life without a tiling window manager is insufferable. I went back to Arch Linux.
Luckikly, everything still works great out of the box.
Installation
NVIDIA driver
Pop!_OS NVIDIA variant comes up NVIDIA driver v545 by default (as of this post). It seems to work well for me. However, it doesn’t, maybe try downgrading to an older version.
Some reported that v530 or v535 works better for them.
NVIDIA Container Toolkit
Don’t use the guide found on Pop!_OS’s website. Instead, follow the official guide from NVIDIA website here .
Simple benchmark with hashcat
sudo apt update && sudo apt install hashcat
# hashcat -b -m <hash_type>
# -b: benchmark mode
# -m 0: md5 hash type
hashcat -b -m 0
Expected output
CUDA API (CUDA 12.3)
====================
* Device #1: NVIDIA GeForce RTX 3090, 23307/24258 MB, 82MCU
* Device #2: NVIDIA GeForce RTX 3090, 23987/24259 MB, 82MCU
OpenCL API (OpenCL 3.0 CUDA 12.3.99) - Platform #1 [NVIDIA Corporation]
=======================================================================
* Device #3: NVIDIA GeForce RTX 3090, skipped
* Device #4: NVIDIA GeForce RTX 3090, skipped
Benchmark relevant options:
===========================
* --optimized-kernel-enable
-------------------
* Hash-Mode 0 (MD5)
-------------------
Speed.#1.........: 70679.3 MH/s (38.63ms) @ Accel:128 Loops:1024 Thr:256 Vec:8
Speed.#2.........: 70505.2 MH/s (38.63ms) @ Accel:128 Loops:1024 Thr:256 Vec:8
Speed.#*.........: 141.2 GH/s
Tensorflow
Docker for everything :)
This is the command I use to launch a new container with Tensorflow
docker run -v $PWD:/workspace -w /workspace \
-p 8888:8888 \
--runtime=nvidia -it --rm --user root \
tensorflow/tensorflow:2.15.0-gpu-jupyter \
bash
Created a simple script to test it, named hello-world.py
with following content and test.
#!/usr/bin/python3
import tensorflow as tf
hello = tf.constant('Hello, TensorFlow!')
tf.print(hello)
tf.print('Using TensorFlow version: ' + tf.__version__)
with tf.device('/gpu:0'):
a = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[2, 3], name='a')
b = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[3, 2], name='b')
c = tf.matmul(a, b)
tf.print(c)
You can also launch Jupyter notebook in there and access it from the host machine via 127.0.0.1:8888
.
jupyter notebook --ip 0.0.0.0 --no-browser --allow-root
SysFS had negative value (-1) error
Find GPU pci id with
ls /sys/module/nvidia/drivers/pci:nvidia/
# 0000:01:00.0 0000:21:00.0 bind module new_id remove_id uevent unbind
And then
echo 0 | tee /sys/module/nvidia/drivers/pci:nvidia/0000:01:00.0/numa_node
echo 0 | tee /sys/module/nvidia/drivers/pci:nvidia/0000:21:00.0/numa_node
ref: StackOverflow
Or you can run this script to fix it.
#!/usr/bin/env bash
if [[ "$EUID" -ne 0 ]]; then
echo "Please run as root."
exit 1
fi
PCI_ID=$(lspci | grep "VGA compatible controller: NVIDIA Corporation" | cut -d' ' -f1)
#PCI_ID="0000:$PCI_ID"
for item in $PCI_ID
do
item="0000:$item"
FILE=/sys/bus/pci/devices/$item/numa_node
echo Checking $FILE for NUMA connection status...
if [[ -f "$FILE" ]]; then
CURRENT_VAL=$(cat $FILE)
if [[ "$CURRENT_VAL" -eq -1 ]]; then
echo Setting connection value from -1 to 0.
echo 0 > $FILE
else
echo Current connection value of $CURRENT_VAL is not -1.
fi
else
echo $FILE does not exist to update.
fi
done
Misc
nvlink
Apparently, you need this to better ultilize multiple GPU cards. The question remains whether you see it’s worth it or not. Some says that the performance gain is not worth it.