IO Error when copying files over NFS

by Tsvi Mostovicz - Thu 27 December 2018

Tags #IT

Reading time: 4 minutes, 29 seconds

The original issue

Jenkins started giving Input/Output errors when issuing cp commands during a specfic job. This job's workspace is on a local disk. The cp command copies the result to a directory mounted on our NFS server.

First try - S.M.A.R.T.

Our first try was to check for hard disk errors by checking S.M.A.R.T. data:

I ran on our NFS server:

smartctl -x -d megaraid,0 /dev/bus/0
smartctl -x -d megaraid,1 /dev/bus/0
smartctl -x -d megaraid,2 /dev/bus/0
smartctl -x -d megaraid,3 /dev/bus/0
smartctl -x -d megaraid,4 /dev/bus/0
smartctl -x -d megaraid,5 /dev/bus/0

Results looked OK, HD 5 was giving signs of some fatigue, but still looked OK.

Next, we checked the hard disks on the 2 client servers which are used as Jenkins agents.

S.M.A.R.T. returned clean there as well. (Actually on one of the servers showed S.M.A.R.T. to be disabled).

Second try - badblocks

Running badblocks -b 4096 -vs /dev/mapper/rhel-home on the various servers showed everything to be Ok.

More issues and escalating to higher levels

At this point I got a report from a co-worker that had a similar issue:

ncvlog: *W,DLSYNC: Library 'worklib' did not sync to disk (Input/output error).
    Total errors/warnings found outside modules and primitives:
            errors: 0, warnings: 543
ncvlog: *F,INTERR: INTERNAL EXCEPTION
-----------------------------------------------------------------
The tool has encountered an unexpected condition and must exit.
Contact Cadence Design Systems customer support about this
problem and provide enough information to help us reproduce it,
including the logfile that contains this error message.
    TOOL: ncvlog(64)      15.20-s053
    HOSTNAME: xxxxxxxxxxxxxxxx
    OPERATING SYSTEM: Linux 2.6.32-754.6.3.el6.x86_64 #1 SMP Tue Sep 18 10:29:08 EDT 2018 x86_64
    MESSAGE: xdlib_expand() - invalid mapping
-----------------------------------------------------------------

We decided to discuss the issue with IT to see whether they could:

provide other possible solutions (swapping disks in the RAID controller)
run fsck at boot during a night run

Creating a test case

At this point my boss requested me to investigate whether I could manually reproduce the issue. As the issue seemed to stem from big files, I decided to create one such file:

dd if=/dev/urandom of=random.bin bs=1048567 count=2048
for x in {1..12}; do
    echo $x;
    cp -v random.bin ~tsvi_m/regression_runs/;
    rm -v ~tsvi_m/regression_runs/random.bin;
done

That went ok. I changed count to 8192 so as to create a 8GB file of gibberish. Eureka!

Every cp and rm command failed.

Looking up the solution

Taking a look at /var/log/messages during the cp command I started seeing a lot of the following messages:

kernel: nfs: server <servername> not responding, timed out

Searching Google, I found the following solution. Reading through this article I found a mention of the timeo and retrans values.

I briefly remembered that when configuring the NFS mounts, I copied some values from a website without fully understanding them.

Looking at the result of the mount command on one of the NFS clients, I saw that all the mounts were mounted with a timeo value of 14. That's 1.4 sec before a timeout is emmited.

I found the values I had configured under /etc/auto.home. During my research I discvoered that wsize and rsize would be automatically negtiotiated. I decided therefore to rewrite the configuration from:

*   -fstype=nfs,soft,intr,rsize=32768,wsize=32768,timeo=14,nosuid,tcp nfsserver:/home/&

to:

*   -fstype=nfs,soft,intr,nosuid,tcp nfsserver:/home/&

The resulting values should be shown when reading /proc/mounts.

Testing the new configuration

To reload the values, we need to restart the autofs service while none of the user's files are in use. I disconnected all of my logins on one of the clients and restarted autofs using service autofs restart.

After logging back in, I checked /proc/mounts:

nfsserver:/home/tsvi_m /home/tsvi_m nfs4 rw,nosuid,relatime,vers=4.1,rsize=1048576,wsize=1048576,namlen=255,soft,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=x.x.x.x,local_lock=none,addr=x.x.x.x 0 0

We got the default timeo value of 600, and as an added bonus, we got buffers that were 32x the original size.

Re-runnning our test case, I could now see that the 8GB files copied OK.

Implementing the change

After notifying the users, I started shutting down processes owned by logged in users using the following command:

pkill -u <user_name>

After I kicked off most users from the various machines, I still needed to manually umount /home/<user> from one of the machines. At first I tried finding what process was still running by using lsof +D /home/<user>, but some of the directories received a permission denied although I was runnning as root.

After kicking off all users, I ran service autofs restart, and sent an email notifying the users of a succesfull reconfiguration.

Checking /proc/mounts showed users logging in with the new settings.