Feed on Posts or Comments

Monthly ArchiveFebruary 2005



Dcache & Work admin on 25 Feb 2005

Started install of first service challenge node - slow due to issues with the network card I’m trying to install via being detected as eth1, which involves subtle changes to various config option in our kickstart things, and once the node was installed various changes to the system to get things working as expected- depsite installing via eth1 it sets eth0 with the network settings and makes it the gateway device.

Dcache & Work admin on 23 Feb 2005

23/02/2005

Realised inability to delete files was simple permissions error - not caused by reboot but by data clearout on 17th. Asked Storage team to look at agressive cleaning policies of a pool so that we can try and get CMS to use srm-advisory-delete and not the hacky methods they currently use.
Finally managed to get a little work done on service challenge -decide what boxes to use for what.

22/02/2005

lcgui01 hung with nfs state D, so took oppurtuity to deploy postgres server and upgrade kernel on dcache head node.
CMS reported inability to delete files, unable to replicate at my end

21/02/2005

Over weekend doing some vacuuming of the postgres database on dcache.gridpp freed up some more space allowing us to continue transfers until a log filled up somewhere on Saturday.
Deployed csfnfs51 and 4 dcache pools for CMS, CMS restarted transfers

18/02/2005

postgres data partition on dcache.gridpp.rl.ac.uk filled up, did yum clean to free some space then coordinated with MJB to get another box to use as postgres server. Decided upon CMS-dcache as it was currently unused. CMS doing transfers and seeing good rates so held off adding it for now
Deployed dcachepools on csfnfs39 for CMS

17/02/2005

Wiped CMS data on dcache after they reported that they’d either have to delete it all or move it all
Dumped info on Storage group about how to setup hsm backed dcache pool
Deployed 2 dcache pools on csfnfs42 for CMS to get round nfs status D problem

Dcache & Work admin on 16 Feb 2005

16/02/05

CMS worried about deployment of dcache, and about our approach/activites, meeting with NGHW, Dave Newbold and RAS to reassure them. Moved csfnfs42 outiside firewall, ran into problem seen when initially deploying on csfnfs42 certificates with mismatched names - worked out the cause this time though - nscd, a service nscd restart gets things wokring again. Began deploying dcache onto csfnfs39, taking it carefully -don’t know what we did to nfs42 is important, user data on disk as well, tricky to coordinate DNS change, pool disable and everything.

15/02/05

Day off

14/02/05

Reboot dcp352 and dcp0337 to use non-smp kernel and asked Tim to test - dcp352 getting nfs problems within 20 minutes, so that failed to fix status D issue

Dcache & Work admin on 13 Feb 2005

11/02/05

Storage group did stress test of dcache hsm interface, talked to Tier 2 coordinators about deploying dcache at one site in each Tier 2 - probably get storage group to go on site, coordinators want to requirements for dcache system. Asked Michael Ernst about pool gridftp issue, response was to use push mode, d’led latest client and installed on lcgui02, didn’t manage to get it working however. Discussions with Martin about trying different kernels to get rid of nfs problem.

10/02/05

Working with storae group to improve hsm interface script, Toast meeting.

Dcache & Work admin on 09 Feb 2005

Getting a different behaviour from csfnfs42 - it appears to be waiting with an open port on 50000. Nothing seems to have been changed since yesterday though at our end. The waiting is likely to be due to firewall issues.

More working with Storage group - files are now written to tape and the status is successfully updated, however restores from tape never occur.

Dcache & Work admin on 08 Feb 2005

Continued to work on csfnfs42, ca certs made no difference, adding the host certificate got us further, but now runnign into problems with mismatched DNS, which we don’t see on other boxes:

02/08 17:34:35 Cell(csfnfs42_1@csfnfs42Domain) : org.globus.common.ChainedIOException: Authentication failed [Caused by: Opera
tion unauthorized (Mechanism level: Authorization failed. Expected "/CN=host/castorgrid.cern.ch" target but received "/C=CH/O=
CERN/OU=GRID/CN=host/castorgrid08.cern.ch")]. Caused by GSSException: Operation unauthorized (Mechanism level: Authorization f
ailed. Expected “/CN=host/castorgrid.cern.ch” target but received “/C=CH/O=CERN/OU=GRID/CN=host/castorgrid08.cern.ch”)

Still trying to track down differences, have request certificate for dcp0344 to try and reproduce there on sl3. If this is a rh7.3 problem we could be in trouble.

Opened up dcache.gridpp.rl.ac.uk for local file access, Tim Barrass now able to delete zero length files. Read/Writing not possible or will do bad things however. Need to document wormhole digging.

Got improved hsm interface script from storage group, dropped in and confirmed working, stress testing to follow later.

Dcache & Work admin on 07 Feb 2005

Got toast agenda and actions written and distributed.

Rebooted 352 to fix nfs hang,

Ensured I could reproduce failure of srmcp 3 party transfer to pool node without certificate. Should install CA certs on nfs42 just to check that doesn’t just need them.

Dcache & Work admin on 04 Feb 2005

Pools appear to be dropping out, this looks like it could be a nfs problem that we also see on the frontends where processes get stuck in state D. There isn’t any solution to this problem that we know about so we may hav to install dcache pool on the disk servers.

Dcache & Work admin on 03 Feb 2005

CMS began pushing data at the production dcache after working around their issues with lcg-gt by roundrobbing around our gridftp servers. Unfortunately dcp352 seemed to not like this, as it was getting hit by all the data transfer, causing nfs accesses to hand requiring a reboot twice. A useful thing to remember: if adding differing quantities of disk to dcache ensure that the largest are not on the same pool node, as they tend to get filled up first, dcache is okay at load balancing pools but not good at load balancing pool nodes. I learnt how to use the cm fake command in the PoolManager module to set the space cost to some fixed value for a large pool to try and push data at smaller pools :

(PoolManager) admin > cm fake dcp352.gridpp.rl.ac.uk_2
dcp352.gridpp.rl.ac.uk_2 -space=3.6E-7 -cpu=-1.0
(PoolManager) admin > cm fake dcp352.gridpp.rl.ac.uk_2 off
Faked Costs switched off for dcp352.gridpp.rl.ac.uk_2
(PoolManager) admin > cm fake dcp352.gridpp.rl.ac.uk_2
dcp352.gridpp.rl.ac.uk_2 -space=-1.0 -cpu=-1.0

Dcache & Work admin on 02 Feb 2005

Monday, Tuesday Gridpp@Brunel

Wednesday

I thought I’d got local looking file access nearly working on Tuesday, but today I did a thorough going through of various tests and configurations and it looks like I was mistaken. I can get files written in, but they’re not hitting the pools - presumably they’re actually being stored inside the gdbm files, which isn’t a good thing. Oddly though I get different behaviour(errors) using dr35.esc than with dteam001.dteam, which shouldn’t be the case if this is totally gsi/grid access which I pretty positive this isn’t, I fairly sure most operations are hitting the pnfs server via nfs directly.

As dr35.esc

Test          |  mkdir   rmdir  chmod_dir   read_db   read_pool   rm_db   rm_pool   cp
              |
Setup         |
-----------------------------------------------------------------------------------------
!IO!LD!P          v        v       v           v         b          v        v      db
IO!LD!P           V        v       v           v         b          v        v      db
!IOLD!P           v        v       v           x         x          v        x      x
IOLD!P            v        v       v           x         x          v        x      x
!IO!LDP           v        v       v           v         b          v        v      db
IO!LDP            v        v       v           v         b          v        v      db
!IOLDP            v        v       v           x         x          v        v      x
IOLDP             v        v       v           h         v          v        v      x
IOLDPE            v        v       v           h         v          v        v      x
	

As dteam001.dteam

IOLDP             v        v       v           h         v          v        v      0

IO = DCACHE_IO_TUNNEL
LD = LD_PRELOAD
P = Proxy
E = X509_USER_PROXY env variable

read_pool = b less sees file as binary

read_db = h less hangs

cp = db file is created in db

cp = 0 zero length file is created

Also has a meeting about service challenges - need to do 100MB/s sustained for two weeks, we’re to build a seperate dcache instance specifically for the service challenge