Category ArchiveWork
Dcache & Work Derek on 07 Mar 2006
21/02/06 - 07/03/06
7th March
T1 - multiple T2 transfer, GS not around - only IC and QMUL doing transfers - initiated transfers to OX, GLA, BHAM and MAN myself at 10am, stopped at noon
ESC-Services Network Operations meeting
6th March
Monday Morning Ops Meeting
SC phone conference
3rd March
RC report
Tried to get replication status information out of slony
2nd March
T1 to multiple T2 transfer tests - attempting to diagnose network fault previously seen
1st March
GridPP-Storage phone conference
SC phone conference
28th February
T1 to multiple T2 Transfer tests, started at 10am asked to stop by Site Networking at 2:50pm due to affecting other site traffic
Asked DK about APEL publishing- RAS had noticed we’d stopped publishing
Attended meeting about Lustre
Did some lingering rpm updates from run by ST
27th February
Monday Morning Operations Meeting
Noted RAL-LCG2 failing SFT due to CA certificate update, asked LCG-ROLLOUT when new release via LCG expected - no reply
Mailed GS about files to be use in transfer test- replicated the intended file over more disk servers. Debugged some transfers failing due to permission problems - permision changer script hadn’t started back aup again after restart
Mailed person who had sent e-mail about OPN to CERN outage asking about current status - still down
24th February
Applied various outstanding updates to systems
RC Report with ST
23rd February
Found 2 CMS transfers, to same pool, through same gridftp server at FNAL at same time -1 failed - 1 succeeded, reported these to TB, who replied telling me about problems with the stager at FNAL.
22nd February
GridPP-Storage phone conference
21st February
Debugging CMS transfers
Dcache & RT & Uncategorized Derek on 20 Feb 2006
20/02/2006/
Shutdown startup: vacuumed postgres db’s on 350 and pnfs - lots of disk space reclaimed, upgraded pnfs to 8.1.3. Installed slony on pnfs and setup replication to 438 - still needs logrotation and startup scripts to be done, would have taken less time than it did but vi decide to be too clever by half and not show me that the files I was editing were msdos style and not unix. CB was trying to dcap write access working, pointed him towards gsidcap but our system was still in pieces at that point so couldn’t really help out that much. Altered pools once they came up to correct settings for multiple io mover queues. Checked gftp doors now using new gftp queue.
Deleted 200+ tickets from helpdesk after mailstorm due to batch scheduler wierdness.
Dcache & Work Derek on 18 Feb 2006
17/02/2006
Mailed TB logs of a failed FNAL-RAL transfer
Installed various kernel updates
Built postgres 8.1.3 for SL3
Assisted in farm shutdown
Dcache & RT & Work Derek on 16 Feb 2006
26/01/2006 - 16/02/2006
16th February
Built postgres 8.1.3 for SL3, helped OS with dCache PoolManager
Mailed Zeus about 2 zero length files
Did various RGMA requests
15th February
Continued configuring new SL4 postgres server
Built slony 1.5 for postgres 8.1.3
GridPP-Storage phone conf
Shadowed ST doing relocateable WN upgrade to LCG 2.7.0
14th February
Mailed CERN about link - turned out to be CERN configuration issue
Mailed Lancaster about pingable host on their end of UKLight for more monitoring
Installed new SL4 postgres server
13th February
Monday Morning Ops meeting
Tweaked RT’s web ui on replies to not attempt to set Ticket owner to current owner - was interacting with autotaking
Restarted gftp servers - all stuck at max transfers
Noticed UKLight down, mailed Site Networking
Setup multiple io queues on disk servers - began restarting quiet ones - leave rest till powerdown
Configured gftp servers to use gftp queue
10th February
Meeting with ST- reviewed Job plan
Setup autotaking of tickets on reply in RT
Reviewed SFT failures for RC report
9th February
TOAST meeting
Mailed TB about huge number of errors reported in dCache logs from file acces from lcgui02 - looks like files not being closed properly - but still not really resolved.
Added query for grid v non-grid usage to T 1 metrics page on wiki
8th February
266,270 couldn’t access yumit - turned out to be nscd still using ip address of old system - nscd -i hosts got things working again
Helpdesk fell over - rebooted
7th February
Installed new certificates - but left keys encrypted causing gridftp transfers to fail for 4 hours - fixed
Checked GridPP-Storage table’s Tier 1 historical numbers for RAS
Supplied UKLight plots to MJB
6th February
Bulk requested 8 host certificates, provided feedback on experience to JJ and MV
Supplied gridusage plots to ST
Sent around updated TOAST agenda
3th February
Holiday
2nd February
Holiday
1st February
Holiday
31st January
Reported 2 problems with yum it to CC
Talking with PS, decide that RT < -> UKIROC Footprints problem was down to problematical site mail server, configured helpdesk to not use that mail server.
30th January
Monday morning ops meeting
Asked ca people about bulk cert request script
Mailed CC & ST about yumit not displaying packages in host detail
Updated scarf helpdesk aliases to point to HPCSG’s footprints box
27th January
Supplied RAS with Grid vs Non-Grid CPU time totals
26th January
Mailed GC some questions for CHEP
Dcache & Work Derek on 25 Jan 2006
25/01/2006
Setuping up Babar’s space on dcache
Asked for input on Experiments activites vs GridPP milestones - decided on free-for-all, got people to schedule T1-T2 transfers over the remainder of the month
Attended GridPP-Storage phone conference
Helped out OS with dCache problem
reinstalled Marley
Dcache & Work Derek on 25 Jan 2006
24/01/2006
Transfers to CERN finshed at around midday - 150MB/s achieved overnight
Lancaster did some transfes from early afternoon - 833Mb/s
Marley’s disk reported errors - handed over to GP
Assisted CB with dCache problem
Dcache & Work Derek on 24 Jan 2006
23/01/2006
(Makes it easier to read all of long log entries if they have a title, so going to try and remember and put relevant dates in work entries)
Made various tweaks to dCache -restarted all gridftp door over course of the morning - this fixed an balance issue we were seeing with SC transfers avoiding gftp0444 -probably due to lingering connections to that system
Restarted some more pools
Cleared the CMS files from nfs39 left over when we gave it to LHCb, as dCache decides to use pools based on the amount of free space, not the the amount of “freeable” space so nfs39’s pools weren’t getting as much use as they might have.
Noticed queue of transfers on nfs39 - so raised maximum movers - immediate data movement out of server to lhcb jobs on batch farm, still not understood why things had got queued though - possibly too many jobs opening multiple files?
Attended:
Monday morning ops meeting
Tech discussion (gave talk)
SC3 phone conference (mailed RAL’s report afterwards to JS too)
Dcache & Work Derek on 20 Jan 2006
Friday Jan 2oth
More babysitting SC3 rerun, csfnfs51 gave too many files errors - restarted it
Research for talk on Dcache and SRB on Monday
Various errata rpms applied to systems
Meeting with ST
Thursday 19th
Monitoring SC3 rerun
Research for talk
Dcache & Work Derek on 19 Jan 2006
Baby sat SC3 rerun:
rebooted gftp0447
discovered what seemed to be two dcache-pool services running on csfnfs63, at least lots of java process were still there after a service dcache-pool stop, killing all them and restarting dcache-pool seems to have got rid of the poolRestarted messages in the PoolManger logs, however csfnfs63 is still taking data in much faster than any other disk server.