WP4 phone conference, 05/11/2003

 

  • NIKHEF: Martijn
  • CERN: Maite, Jan, Sylvain
  • ZIB: Thomas
  • University of Heidelberg: Frank

Review of actions

  • Action 18: Thomas wanted to test using SMP machines and we don’t have any of those available on our WP4 testbed. The porting to LSF is finished and the testing has been done at ZIB. Action CLOSED.
  • Action 90: After last conference discussion/clarifications between the monitoring and FT tasks, Lord has produced a new RPM fixing the following points:
    • 6. Improve the logging facility.
    • 9. Not easily readable, need more 'user oriented' messages
    • 10. Timestamp should be given on every line
    • 15. Reinject actions/rule results in monitoring as metrics in addition to log. Really important to have full loop.
  • Action 91: Maite contacted the WP1 people responsible for accounting (Andrea Guarise). This is the present status from him:

“I had some trouble (not so much) putting all together (maui,rtcs and the sensor) and having it work, but now It seem to be working fine on my CE in Turin. Currently I'm performing a little stress-test to see whether the whole procedure is stable enough or not, but the result on a couple of hundred jobs submitted are good (~100%

success).The bad news are that I had to do a couple of changes to your accounting sensor script (just 2 or tree lines due to unnoticed changes to the name of some variables, and to the fact that the user credentials now are cleaned from the user dir -before- the sensor run, so I use the host credential ), and that we (wp1) still need to fine tune some part of the job submission process related to the accounting part (we plan to do these changes in the next weeks).So to conclude, given these small changes to the script, things seems to be working. I only noticed that maui seems to be rather slow in the scheduling process, but this may be caused by some lack in my configuration.”

Maite requested him to log the changes via bugzilla so Thomas could incorporate them into the code. No news from him since then.

The RMS was switched on at NIKHEF’s application testbed, as agreed. Some problems due to maui node features not reported (node feautures = arbitrary properties of nodes). Maui uses it with feature attributes. Due to this jobs didn't get executed. The RMS was switched off. Thomas fixed the bug and produced a new RPM that was installed in the development testbed. Action on Thomas to report the problems and the new version of the RPM via Bugzilla.

 

New actions:

-

Institute reports:

  • German (CERN, Installation task):

http://cern.ch/wp4-install/documents/wp4-install-progress-report-2003-0511.htm

  • Martijn (NIKHEF, gridification task):
    • 2.5.2 LCMAPS AFS and Kerberos modules: ongoing/ delayed. Gssklog daemon has been set up for testing (ongoing). Release: end of November.
    • 2.5.5 LCAS server implementation: delayed.
    • 2.5.6 Job repository: ongoing. Server has been set up. Working on API and LCMAPS plugin. Release: November/December?
    • 2.5.9  Integration, deployment and support: ready/ongoing
    • 2.5.10 Support for the release 2 system to the testbed: ongoing
  • Thomas (ZIB, resource mgt task): We were working on the final report, supporting our components and preparing for the Cluster 2003 conf.
  • Lord (Fault Tolerance task, Heidelberg):

1. Stress tests in the testplan. CLOSED

2. Resources used by daemon. CLOSED

3. Large number of rules configured (~1000). CLOSED

4. Complex rules configured (>100 operands). CLOSED

5. Put trash in monitoring data. CLOSED/NOT POSSIBLE?

6. Improve the logging facility. CLOSED

7. Still nothing about ok/bad rules in log file CLOSED

8. Some debugging messages (pointers...) CLOSED

9. Not easily readable, need more 'user oriented’ messages CLOSED

10. Timestamp should be given on every line CLOSED

11. Fix rpm name with gcc 3.2.2 CLOSED

12. Provide a clean source tarball. CLOSED

13. Improved configuration with quattor NOT STARTED

14. Not link fmon API statically to avoid necessary rebuildings in case of updates. CLOSED

15. Reinject actions/rule results in monitoring as metrics in addition to log. Really important to have full loop. ONGOING

16. Maximum number of trials if actuator does not improve situation. ONGOING

  • Jan (Fabric Monitoring):
    • 2.3.4 Alarm Display: Work is on going to scale up to 1600 nodes with 20 metrics.
    • 2.3.9 Perl implementation of the repository API: Using the simplified C-wrapper around the repository API, Sylvain has generated bindings for Perl, Tcl and PHP, using SWIG.
    • 2.3.12 ORACLE interface: improved logging, bug fixes.
    • 2.3.6 Integration and deployment on the EDG testbed: New release (including shared libraries)
    • 2.3.14 Interface open source Database: Metadata schema has been documented. This schema will be used by the OraMon server as well.
    • 2.3.21 MSA developments:
      • Sensor response check:  done
      • Local sample on demand: started

AOB

-         WP4 Final Evaluation Report:

o       EDMS link: https://edms.cern.ch/document/409217/1

o       Tool evaluation status

ü      For quattor: done by CERN service managers

ü      For EDG-LCFGng: 10 evaluations received till now

ü      For gridification components: no evaluation received yet

ü      For LEMON: KIP evaluation received

ü      For FT: KIP evaluation received

ü      For RMS: no evaluation received yet

o       Document skeleton distributed by Maite, no comments received, so accepted.

o       Architecture contribution received. Comments sent back to FT and monitoring. More contributions requested, please, respect the deadlines and send them on time

o       The PTB review will be done by:

§         Moderator: Franck Bonnassieux (WP7)

§         Reviewers: Cal Loomis (WP6), Kors Bos (PMB), Stefano Becco (WP1)

-         EU review: It will be at CERN, 19-20 February. No demos form mw WPs, just from the applications. A half an hour presentation per mw WP with 10 minutes for questions; many questions are expected due to the fact that it is the project closure (on funding, results, dissemination…).

 

Next meeting: 19/11/2003