Feed on Posts or Comments

Category ArchiveDcache



Dcache & Work Derek on 07 Mar 2006

21/02/06 - 07/03/06

7th March

T1 - multiple T2 transfer, GS not around - only IC and QMUL doing transfers - initiated transfers to OX, GLA, BHAM and MAN myself at 10am, stopped at noon
ESC-Services Network Operations meeting

6th March

Monday Morning Ops Meeting
SC phone conference

3rd March

RC report
Tried to get replication status information out of slony

2nd March

T1 to multiple T2 transfer tests - attempting to diagnose network fault previously seen

1st March

GridPP-Storage phone conference
SC phone conference

28th February

T1 to multiple T2 Transfer tests, started at 10am asked to stop by Site Networking at 2:50pm due to affecting other site traffic
Asked DK about APEL publishing- RAS had noticed we’d stopped publishing
Attended meeting about Lustre
Did some lingering rpm updates from run by ST

27th February

Monday Morning Operations Meeting
Noted RAL-LCG2 failing SFT due to CA certificate update, asked LCG-ROLLOUT when new release via LCG expected - no reply
Mailed GS about files to be use in transfer test- replicated the intended file over more disk servers. Debugged some transfers failing due to permission problems - permision changer script hadn’t started back aup again after restart
Mailed person who had sent e-mail about OPN to CERN outage asking about current status - still down

24th February

Applied various outstanding updates to systems
RC Report with ST

23rd February

Found 2 CMS transfers, to same pool, through same gridftp server at FNAL at same time -1 failed - 1 succeeded, reported these to TB, who replied telling me about problems with the stager at FNAL.

22nd February

GridPP-Storage phone conference

21st February

Debugging CMS transfers

Dcache & RT & Uncategorized Derek on 20 Feb 2006

20/02/2006/

Shutdown startup: vacuumed postgres db’s on 350 and pnfs - lots of disk space reclaimed, upgraded pnfs to 8.1.3. Installed slony on pnfs and setup replication to 438 - still needs logrotation and startup scripts to be done, would have taken less time than it did but vi decide to be too clever by half and not show me that the files I was editing were msdos style and not unix. CB was trying to dcap write access working, pointed him towards gsidcap but our system was still in pieces at that point so couldn’t really help out that much. Altered pools once they came up to correct settings for multiple io mover queues. Checked gftp doors now using new gftp queue.
Deleted 200+ tickets from helpdesk after mailstorm due to batch scheduler wierdness.

Dcache & Work Derek on 18 Feb 2006

17/02/2006

Mailed TB logs of a failed FNAL-RAL transfer
Installed various kernel updates
Built postgres 8.1.3 for SL3
Assisted in farm shutdown

Dcache & RT & Work Derek on 16 Feb 2006

26/01/2006 - 16/02/2006

16th February

Built postgres 8.1.3 for SL3, helped OS with dCache PoolManager
Mailed Zeus about 2 zero length files
Did various RGMA requests

15th February

Continued configuring new SL4 postgres server
Built slony 1.5 for postgres 8.1.3
GridPP-Storage phone conf
Shadowed ST doing relocateable WN upgrade to LCG 2.7.0

14th February

Mailed CERN about link - turned out to be CERN configuration issue
Mailed Lancaster about pingable host on their end of UKLight for more monitoring
Installed new SL4 postgres server

13th February

Monday Morning Ops meeting
Tweaked RT’s web ui on replies to not attempt to set Ticket owner to current owner - was interacting with autotaking
Restarted gftp servers - all stuck at max transfers
Noticed UKLight down, mailed Site Networking
Setup multiple io queues on disk servers - began restarting quiet ones - leave rest till powerdown
Configured gftp servers to use gftp queue

10th February

Meeting with ST- reviewed Job plan
Setup autotaking of tickets on reply in RT
Reviewed SFT failures for RC report

9th February

TOAST meeting
Mailed TB about huge number of errors reported in dCache logs from file acces from lcgui02 - looks like files not being closed properly - but still not really resolved.
Added query for grid v non-grid usage to T 1 metrics page on wiki

8th February

266,270 couldn’t access yumit - turned out to be nscd still using ip address of old system - nscd -i hosts got things working again
Helpdesk fell over - rebooted

7th February

Installed new certificates - but left keys encrypted causing gridftp transfers to fail for 4 hours - fixed
Checked GridPP-Storage table’s Tier 1 historical numbers for RAS
Supplied UKLight plots to MJB

6th February

Bulk requested 8 host certificates, provided feedback on experience to JJ and MV
Supplied gridusage plots to ST
Sent around updated TOAST agenda

3th February

Holiday

2nd February

Holiday

1st February

Holiday

31st January

Reported 2 problems with yum it to CC
Talking with PS, decide that RT < -> UKIROC Footprints problem was down to problematical site mail server, configured helpdesk to not use that mail server.

30th January

Monday morning ops meeting
Asked ca people about bulk cert request script
Mailed CC & ST about yumit not displaying packages in host detail
Updated scarf helpdesk aliases to point to HPCSG’s footprints box

27th January

Supplied RAS with Grid vs Non-Grid CPU time totals

26th January

Mailed GC some questions for CHEP

Dcache & Work Derek on 25 Jan 2006

25/01/2006

Setuping up Babar’s space on dcache
Asked for input on Experiments activites vs GridPP milestones - decided on free-for-all, got people to schedule T1-T2 transfers over the remainder of the month
Attended GridPP-Storage phone conference
Helped out OS with dCache problem
reinstalled Marley

Dcache & Work Derek on 25 Jan 2006

24/01/2006

Transfers to CERN finshed at around midday - 150MB/s achieved overnight
Lancaster did some transfes from early afternoon - 833Mb/s
Marley’s disk reported errors - handed over to GP
Assisted CB with dCache problem

Dcache & Work Derek on 24 Jan 2006

23/01/2006

(Makes it easier to read all of long log entries if they have a title, so going to try and remember and put relevant dates in work entries)

Made various tweaks to dCache -restarted all gridftp door over course of the morning - this fixed an balance issue we were seeing with SC transfers avoiding gftp0444 -probably due to lingering connections to that system
Restarted some more pools
Cleared the CMS files from nfs39 left over when we gave it to LHCb, as dCache decides to use pools based on the amount of free space, not the the amount of “freeable” space so nfs39’s pools weren’t getting as much use as they might have.
Noticed queue of transfers on nfs39 - so raised maximum movers - immediate data movement out of server to lhcb jobs on batch farm, still not understood why things had got queued though - possibly too many jobs opening multiple files?

Attended:
Monday morning ops meeting
Tech discussion (gave talk)
SC3 phone conference (mailed RAL’s report afterwards to JS too)

Dcache & Work Derek on 20 Jan 2006

Friday Jan 2oth

More babysitting SC3 rerun, csfnfs51 gave too many files errors - restarted it
Research for talk on Dcache and SRB on Monday
Various errata rpms applied to systems
Meeting with ST

Thursday 19th

Monitoring SC3 rerun
Research for talk

Dcache & Work Derek on 19 Jan 2006

Baby sat SC3 rerun:
rebooted gftp0447
discovered what seemed to be two dcache-pool services running on csfnfs63, at least lots of java process were still there after a service dcache-pool stop, killing all them and restarting dcache-pool seems to have got rid of the poolRestarted messages in the PoolManger logs, however csfnfs63 is still taking data in much faster than any other disk server.

Dcache & Work Derek on 17 Jan 2006

Ops meeting
SC3 phone conference
Rebooked ops meeting for next 52 weeks
Kept an eye on dCache hosts, nfs39 showing too many open files error so restarted with updated ulimit -n value - must remember to reinstate that after future upgrades.

Next Page »