CUDA

  1. 安装失败,或者重新安装,清理残余文件

a).deb安装失败的
sudo apt-get--purge remove nvidia*

b).run安装失败的

sudo /usr/local/cuda-8.0/bin/uninstall_cuda_8.0.pl
sudo /usr/bin/nvidia-uninstall

c) 在 a) 或 b) 方法清除后,安装还是失败

sudo apt-get autoremove --purge nvidia-*   
#把nvidia驱动清个干干净净
sudo reboot

1. PRE-INSTALLATION ACTION

1.1 Verify you have a CUDA-Capable GPU
lspci | grep -i nvidia

result show:

01:00.0 VGA compatible controller: NVIDIA Corporation Device 1b81 (rev a1)

01:00.1 Audio device: NVIDIA Corporation Device 10f0 (rev a1)

1.2 Verify you have a Supported Version of Linux
uname -m &&cat /etc/*release

result show:

x86_64

DISTRIB_ID=Ubuntu

DISTRIB_RELEASE=14.04

DISTRIB_CODENAME=trusty

DISTRIB_DESCRIPTION="Ubuntu 14.04.5 LTS"

.....

1.3 Verify the System Has GCC Installed
gcc --version

result show:

gcc (Ubuntu 4.8.4-2ubuntu1~14.04.3) 4.8.4

1.4 Verify the System has the Correct Kernel Headers and Development Packages Installed
uname -r

result show:

4.4.0-66-generic

安装对应的kernels header和开发包:

sudo apt-get install linux-headers-$(uname -r)
1.5 Download the NVIDIA CUDA Toolkit

这里下载最新.run版本

下载完后,用MD5 检验,如果序号不和,得重新下载

md5sum cuda_8.0.27_linux.run
2. RUNFILE INSTALLATION
2.1 Disabling Nouveau
lsmod | grep nouveau

如果有内容输出,则需禁掉nouveau

sudo gedit /etc/modprobe.d/blacklist-nouveau.conf

添加如下内容:

blacklist nouveau
options nouveau modeset=0

执行

sudo update-initramfs –u

验证是否成功禁用

lsmod | grep nouveau

重启电脑

2.2 Reboot Into Text Mode

重启后,进入登录界面的时候,不要登录进入桌面(否则可能会失败,若不小心进入,请重启电脑),直接按

Ctrl+Alt+F1进入文本模式(命令行界面),登录账户。

关闭图形化界面

sudo service lightdm stop

切换到cuda_8.0.27_linux.run的目录

sudo sh cuda_8.0.27_linux.run

!Note:安装的时候,要让你先看一堆文字(EULA),我们直接不停的按空格键到100%;

遇到提示是否安装openGL ,选择NO,其他的可以一路accept, yes或回车

安装成功后,会显示installed,否则会显示failed

重启图形化界面

sudo service lightdm start

登录时能进入桌面,不会一直在重复登录,成功已近大半。

!Note:如果出现重复登陆情况,请卸载cuda,然后重装。

原因:是OpenGLNVIDIA 冲突

卸载:在登陆界面时,按Ctrl + Alt + f1,进入TUI

sudo /usr/local/cuda-8.0/bin/uninstall_cuda_8.0.pl
sudo /usr/bin/nvidia-uninstall
sudo reboot

2.3 Device Node Verification

ls /dev/nvidia*

可能出现a), b), c),d)三种结果,请对号入座。前方高能!

a)若结果显示

/dev/nvidia0  /dev/nvidiactl  /dev/nvidia-uvm

或显示出类似的信息,应该有三个(包含一个类似/dev/nvidia-nvm的),则安装成功

b)如果运气有点背,结果是这样的

ls: cannot access /dev/nvidia*: No such file or directory

或是这样的,只出现

/dev/nvidia0  /dev/nvidiactl

把下面的.sh文件随便命个名Nka.sh

#!/bin/bash
/sbin/modprobe nvidia
if [ "$?" -eq 0 ]; then
  # Count the number of NVIDIA controllers found.
  NVDEVS=`lspci | grep -i NVIDIA`
  N3D=`echo "$NVDEVS" | grep "3D controller" | wc -l`
  NVGA=`echo "$NVDEVS" | grep "VGA compatible controller" | wc -l`

  N=`expr $N3D + $NVGA - 1`
  for i in `seq 0 $N`; do
    mknod -m 666 /dev/nvidia$i c 195 $i
  done

  mknod -m 666 /dev/nvidiactl c 195 255

else
  exit 1
fi

/sbin/modprobe nvidia-uvm

if [ "$?" -eq 0 ]; then
  # Find out the major device number used by the nvidia-uvm driver
  D=`grep nvidia-uvm /proc/devices | awk '{print $1}'`

  mknod -m 666 /dev/nvidia-uvm c $D 0
else
  exit 1
fi

然后执行

sudo chmod +x Nka.sh
sudo ./Nka.sh
ls /dev/nvidia*

添加启动脚本自动加载

sudo gedit /etc/rc.local

如果你是第一次打开这个文件,它应该是空的(除了一行又一行的#注释项外)。这文件的第一行是

#!/bin/sh -e

-e

去掉(这步很重要,否则它不会加载这文本的内容)

然后把Nka.s的内容除了#!/bin/bash外复制到其中,(before exit 0)

保存退出。

下次重启时,你应该能直接看到/dev目录下的三个nvidia的文件

c) 如果人品实在不好(我就遇过几次。。。),结果是这样的

modprobe: ERROR: could not insert 'nvidia_uvm': Operation not permitted

当出现这种情况时,可能是驱动打起架来了。

sudo apt-get autoremove --purge nvidia-* 
#把nvidia驱动清个干干净净
sudo reboot         
#一定记得重启,不然你会后悔的!

然后

sudo ./Nka.sh
ls /dev/nvidia*

这时,应该可以见到

/dev/nvidia0 /dev/nvidiactl /dev/nvidia-uvm

3 POST-INSTALLATION ACTIONS

3.1 Environment Setup

sudo gedit /etc/profile

在文件最后添加

export PATH=/usr/local/cuda-8.0/bin:$PATH

export LD_LIBRARY_PATH=/usr/local/cuda-8.0/lib64:$LD_LIBRARY_PATH

如果启动tensorflow报错

Couldn't open CUDA library libcuda.so.
#手动查找添加进 LD_LIBRARY_PATH
ldconfig -p | grep libcuda

最后

source /etc/profile

3.2 Verify the Installation

3.2.1 Verify the Driver Version

nvidia-smi

cat /proc/driver/nvidia/version

NVRM version: NVIDIA UNIX x86_64 Kernel Module 361.77 Sun Jul 17 21:18:18 PDT 2016

GCC version: gcc version 4.8.4 (Ubuntu 4.8.4-2ubuntu1~14.04.3)

3.2.2 Verify CUDA Toolkit
nvcc -V

nvcc: NVIDIA (R) Cuda compiler driver

Copyright (c) 2005-2016 NVIDIA Corporation

Built on Tue_Jan_10_13:22:03_CST_2017

Cuda compilation tools, release 8.0, V8.0.61

!Note: 如果是这样的:
The program 'nvcc' is currently not installed. You can install itbytyping:sudo apt-get install nvidia-cuda-toolkit

莫方,确认下/etc/profil的配置环境是否正确

即使什么都没改,可能忘了这一步,或是之前执行了,但过了有段时间,且又还没重启电脑。因为source /etc/profile是临时生效,重启电脑才是永久生效

source /etc/profile

再执行(应该就有显示了)

nvcc -V
3.2.3 Complie sample

cd进NVIDIA_CUDA-8.0_Samples目录,执行

make

运行完后,编译结果会放在NVIDIA_CUDA-8.0_Samples目录下的bin目录

3.2.3 Running the Binaries

cdbin目录里面的里面的里面,知道看到一堆可执行文件(菱形的图标),大概是 ~/NVIDIA_CUDA-8.0_Samples/bin/x86_64/linux/release

执行

./deviceQuery
./deviceQuery Starting...
CUDA Device Query (Runtime API) version (CUDART static linking)
Detected 1 CUDA Capable device(s)
Device 0: "GeForce GT 720M"
CUDA Driver Version / Runtime Version          
CUDA Capability Major/Minor version number:    2.1
Total amount of global memory:                 1985 MBytes (2081226752 bytes)
( 2) Multiprocessors, ( 48) CUDA Cores/MP:     96 CUDA Core
GPU Max Clock rate:                            1250 MHz (1.25 GHz)
Memory Clock rate:                             800 Mhz
Memory Bus Width:                              64-bi
L2 Cache Size:                                 131072 bytes
Maximum memory pitch:                          2147483647 byte
exture alignment:                             512 bytes
Concurrent copy and kernel execution:          Yes with 1 copy engine(s)
Run time limit on kernels:                     No
Integrated GPU sharing Host Memory:            No
Support host page-locked memory mapping:       Yes
Alignment requirement for Surfaces:            Yes
Device has ECC support:                        Disabled
Device supports Unified Addressing (UVA):      Yes
Device PCI Domain ID / Bus ID / location ID:   0 / 1 / 0
Compute Mode:< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
eviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 8.0, CUDA Runtime Version = 8.0, NumDevs = 1, Device0 = GeForce GT 720M

Result = PASS

或之类的东东,且最后是 Result = PASS 若失败 Result = FAIL

执行

./bandwidthTest
[CUDA Bandwidth Test] - Starting...
Running on...

 Device 0: GeForce GT 720M
 Quick Mode

 Host to Device Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)    Bandwidth(MB/s)
   33554432            3220.9

 Device to Host Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)    Bandwidth(MB/s)
   33554432            3271.9

 Device to Device Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)    Bandwidth(MB/s)
   33554432            9772.8

Result = PASS

NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.

成功cuda

4.cudnn8.0

tar -zxvf cudnn-8.0-linux-x64-v*.tgz
cd cuda
sudo cp lib64/* /usr/local/cuda/lib64/
sudo cp include/cudnn.h /usr/local/cuda/include/

(no need)更新软连接:

cd /usr/local/cuda/lib64

sudo rm -rf libcudnn.so libcudnn.so.5          #删除原有动态文件
sudo ln -s libcudnn.so.5.1.5 libcudnn.so.5 
sudo ln -s libcudnn.so.5 libcudnn.so

之前安装了某个cuDNN的版本,想替换为新的:

解决方法是删除掉原先的,重新装新的cuDNN

删除旧的

cd /usr/local/cuda/lib64
sudo rm libcudnn*

然后再按照上述方法安装新的

results matching ""

    No results matching ""