Announcements

2024-01-08 to 01-12

maintenance week. closed.

2023-08-28 to 09-01

migration zks systems to rocky9

2023-07-31

das bisherige graylog logging wird durch opensearch ersetzt. Technischer Hintergrund sind Lizenzaenderungen und abgeschaffte Features von graylog 4.

Fuer Nutzer der Kommandozeile gibt es auf acc9 ein opensearch tail --help. Z.B. opensearch tail --host fel0003

Das Webinterface findet sich unter https://logging.acc.gsi.de

Das Webinterface ist ein generisches Interface fuer opensearch. Als solches ist es nicht ganz so einfach zu bedienen wie graylog. Eine Kurzanleitung steht in DiagnosticLogging

Derzeit gibt es ein logshipping von graylog zu opensearch, d.h. Nachrichten die in graylog landen werden auch zu opensearch weitergeleitet und ihr koennt schon mal reinschauen.

Ab KW31 wird das Diagnostic Logging umgebaut und opensearch produktiv. Die Umstellung duerfte ingesamt etwas holprig werden, da recht wenige Dinge im Vorfeld getestet werden koennen und sich IP Adressen, Netzwerkmasken und DNS namen gleichzeitig aendern.

Vermutlich brauchen einige (Java-) Dienste und Frontends dafuer einen restart.

2023-07-24 to 28

virtualization acc7.

2023-07-24 to 28

maintenance week database.

2023-07-17 to 21

maintenance week. closed. Pending public tasks

  • acc6
  • tcl1000

2023 switch to Rocky Linux 9

Rocky Linux will be migrated to version 9

The rough roadmap for 2023

  • in 2022
    • Preparations by ACO-INN
    • adapt eclipse plugin for websvcdev
  • week 1/2023 Done
    • Migration of fsl00c Done
    • Data migration of all nfs directories Done
  • week 2/2023 Done
    • Regular maintenance week. Service interruption for all systems.
    • Service interruption for frontend boot. home directories on all systems. Done. all user facing systems are complete
  • week 3/2023: Done
    • Database maintenance Done
  • weeks 4-5/2023 Done
    • Power off acc8dev Done
    • Setup new acc9dev cluster. Planned completion week 5. No data migration. (Home directories are nfs based and will survive) Done
  • week 5/20223 Done
    • Provide acc9 ramdisk
  • week 6/2023 and subsequent Done
    • replace integration environment FAIR.Intern.SystemInt (el7 to el9). No data migration. Done
    • replace acc8pro virtual machines. (order as required, INT first). No data migration. Service interruption of production services. Done
      • asl151, asl152, asl156, asl157, asl158, asl159 Done
  • Activities in parallel
    • reinstall tcl1000 console environments
    • fesa release for acc9
      • fesa buildhost asl551 upgrade to el9 Done
    • fesa sdk for yocto@acc9 Done
    • adapt eclipse plugin for newer eclipse version (2022-12)
    • reinstall zks
    • reinstall other systems (artifacts, archiving, interlock, logging, etc)

2022-09-26 upgrade storage

firmware upgrade on storage systems nwsr06m and nwsr07m. This is an online upgrade, nothing noticeable should happen.

complete.

2022-09-19 websvcdev

websvcdev will be switched to a new server. Access to the new webserver via webdav, see ClusterAcc9

user data will not be migrated.

2022-08-24 upgrade ovirt

upgrade ovirt. Might affect all virtual machines that are not on vmware. acc8, artifacts, git, jenkins.

In theory everything happens live and nobody will notice. If something goes wrong, everyone will notice.

2022-08-15 vmware migration

downtime for acc6, virtual frontends/scu, integration system, ...

2022-08-08 Oracle Maintenance

Update OS, Firmware, Patches RDBMS. Databases should be online on one node of the cluster, but complete outage is possible.

2022-07-18 maintenance week.

server updates are mostly complete. tcl1000 upgrades will happen over the next days.

completed
  • nfs migration (note: no support for nfs v2 see NfsServerRocky8Migration)
  • artifacts migration (note: new urls see ArtifactRepository)
  • acc7dev
  • acc7pro
  • acc8dev
  • clipboard migration
  • websvcpro migration (note: webdav access only see ClusterAcc8)
  • updates frontend ramdisk (note: current points to el7 based ramdisk)
  • zks
  • asl335, asl735 (note: systems are unsupported)
  • upgrade gitea
  • upgrade foswiki

2022-03

cluster acc8dev is affected by a gfs2 bug (/common/usr and /common/scratch). Processes will get stuck in uninterruptible state (D). Bugfix deployed 2022-07-18

2021-10-25

Maintenance window closed. Pending tcl1000 will be completed in the next days.

Update, Patches, Firmware on all systems

Status

Completed
  • acc7dev, acc7pro, acc7file complete
  • acc8dev-rc1 complete
  • oracle acc database complete
  • archiving complete
  • graylog complete
  • acc6 complete
  • acc7int/vmla complete
  • www-acc complete
  • scu ramdisk updated
  • zks (thursday, 28.Oct)
  • oracle gsi database
  • symantec
  • tcl1000

2021-08-24

networks are reconfigured.

  • acc network is now 140.181.128.0/19 resp 255.255.224.0
  • (most) devices are now in 10.248.0.1/19 resp 255.255.224.0
  • (most) devices are now vlan 2700

If your system has the wrong ip, the easiest way is a full power cycle fully unplugging all power sources. This way also management modules (bmc, ipmi, ilo, drac, xport, ...) will pickup the changes.

2021-08-24

reconfiguration acc network. New netmask 140.181.128.0/19. Systems not picking up changes via dhcp require reconfiguration and reboot.

2021-08-17

reconfiguration embedded systems. New network 10.248.0.0/19. Reboot of all scu, supermicro, micrioc required.

2021-08-12

reconfiguration of all nwe4000 switches. Local network interruptions.

2021-08-10

changes to network connectivity core-it. Global network interruption.

2021-08-09

migrated git.acc.gsi.de to new server and new gitea version. complete.

2021-07-26

Maintenance window closed.

Status
  • acc7dev, acc7pro, acc7file complete
  • acc8dev-rc1 complete
  • oracle database complete
  • zks complete
  • archiving complete
  • graylog complete
  • acc6 comple
  • acc7int/vmla complete
  • www-acc complete
  • scu ramdisk updated

Some tasks are pending and will be completed in august
  • git.acc.gsi.de
  • tcl1000
  • ...

2021-07-26

artifacts/nexus introduce new snapshot and release retention rules, see ArtifactRepository#Retention
  • delete all snapshots not requested for more than 180 days
  • keep a maximum of 14 releases per group-artifact-id

2021-07-26

delete dns entry for bel.gsi.de

2021-01-19

  • replace wsl007/wsl008 hardware
    • downtime for wiki, bugzilla, webdav

2021-01-12

  • stop subversion support

2020-11-11

completed.

Firmware updates on central acc switches. This will interrupt the complete acc network.

2020-11-09

Oracle patching competed

main maintenance complete. Oracle has some patches pending.

  • maintenance week
  • Update, Patches, Firmware all systems

completed

  • acc7
  • acc6
  • logging/graylog
  • archiving
  • k8s
  • ords
  • winccoa
  • ovirt hypervisors
  • codimd
  • id
  • gitea
  • jenkins master/slaves
  • scu ramdisk refresh
  • timingsystems tsl001/tsl019/tsl021/netdisco
  • dal001/dal002
  • wsl007/wsl008 = wiki, bugzilla

failed
  • ovirt hosted-engine failed during upgrade. Recovery is in progress. Recovered

2020-11-02

  • all remaining subversion repositories switched to readonly

2020-08-24

ten days

  • replace acc6 dev with one virtual machine -> asl735
  • replace acc6 pro with one virtual machine -> asl335

2020-08-10

Start upgrade of the ovirt cluster (vmlb).

During preparation virtual machines will be moved to different nodes (interruptions < 1sec).

During final migration machines have to be exported/imported. Service downtimes of 30minutes. Wiki will be updated with better estimates of migration dates.

affected machines and services: vmlb*, jenkins, git, codi, k8s, spice, fec-dns, etc.

upgrade is complete.

2020-07-13

  • maintenance week fibre channel
  • should be transparent. Note "should". We can't test it.

2020-06-29

  • maintenance week ethernet
  • update ethernet switches
  • rolling reboots of all switches

2020-06-24

We need to replace some Hardware damaged by the power outage. Replacement will happen on short notice.

Downtimes for

  • done. wsl007/wsl008 = www-acc.gsi.de
  • done. asl330-asl334 = acc6pro
  • gsi oracle database. migrated to 19c except cdb
  • done. psl003-psl007 = wincc

2020-06-22

  • two to five days
  • Migration oracle gsi database to 19c
  • migration oracle acc database to 19c

2020-06-15

  • Maintenance week
  • Update, Patches, Firmware all systems
  • regular maintenance complete but see power outage notes

  • status power outage greencube
    • on monday 14:00 a power outage in the greencube happened
    • asl730-asl734 failed - fixed. asl730 won't be repaired
    • dbl005 failed - won't be repaired. We migrated to dbl2xx
    • nwsr04 controller failed - repaired

2020-01-14

  • Zks migration el7

2019-08-12

  • Maintenance week
  • Removal of java8 and eclipse-neon

2019-07-30

  • Maintenance acc7dev

2019-05-06

Maintenance window.

Operating system updates on all servers. Including asl74x, asl34x, fileservers, tcl1000, databases, interlock, etc.

Non userfacing systems will be updated before the maintenance window starts.

Java 11 will be rolled out.

Java 8 will be removed from all systems except acc7dev.

For anyone asking why systems are down or if systems are up again, the piggy bank is located at sb1.3.119

fertig.jpg

2019-04-16

Operating system updates including reboots and java 11 on acc7dev (asl740 to asl744) For curious people: that means 1072 software packages including eclipse

acc7dev has been updated.

2019-04-15

With the end of beamtime 2018/2019 the VMS systems (axp*, bel.gsi.de) and old domain controllers (dcw001/dcw002) will be decomissioned.

2019-03-19

git.acc.gsi.de is now in beta state. If nothing serious happens it will be declared productive post beamtime. See also Git and SshAgent

2018-10-15 to 2018-10-19

Maintenance window complete

2018-07-18 to 2018-07-19

Maintenance window complete.

2018-05-18

The new kernel did not help with the nfs problems. We reconfigured the nfs and restarted the systems.

2018-05-14

we still experience nfs problems. We try to patch the kernel. This requires a rolling restart of all acc7 servers.

2018-05-08

asl744 has been rebooted. Reason for nfs problems is unknown. Ticket with redhat is open.

system asl340 lost memory modules. System is currently unavailable. This means OxygenXML is not available

2018-04-10 to 2018-04-11

Maintenance window complete.

2018-03-06

OS upgrades on storage systems. This is transparent and no system should be affected.

2018-01-29 to 2018-02-02

Maintenance window complete

done - Oracle Bundle Patch (accdbp, accdbu, accdbt) on the server side. This affects all Database services including for example LSA. We will try to execute this with a rolling upgrade, node by node, keeping at least one node available. Database sessions require a reconnect on failover.

done - Upgrade Oracle Instantclient on el7 (acc7dev, acc7pro) to version 12.2

Meltdown patches on all user facing systems

  • done - asl730-asl734 acc6dev
  • done - asl330-asl334 acc6pro
  • done - asl740-asl744 acc7dev
  • done - asl340-asl344 acc7pro
  • done - asl102 interlock
  • done - asl103 fe monitor
  • done - tsl001 timing
  • done - dal001 dataacquisition
  • done - dal002 dataacquisition
  • done - psl003-psl005 wincc
  • done partially -tcl1000-tcl10xx thinclient
  • done - vml2x vmware based machines
  • done - vmlax vmware based machines
  • done - vmlbx ovirt based machines
  • done - zkl001-zkl002 zks
  • done - vml003 - vml004
  • done - usl604-usl606

2017-12-04 to 2017-12-05

system maintenance

the acc7 chassis has some defects, requiring dissassembly. For this reason expect that we will need to complete maintenance window and core services like acc7 and nfs won't be available for a longer time.

2017-10-16 to 2017-10-20

Oracle Upgrade to 12c (accdbp, accdbu, accdbt)

2017-07-17 to 2017-07-21

Maintenance window complete.

system maintenance including NFS migration to new hardware. All user facing systems will be down. This includes

  • acc6pro asl330-asl334
  • acc6dev asl730-asl734
  • acc6file asl430-asl432, fsl00c
  • acc7pro asl340-asl344
  • acc7dev asl740-asl744
  • webservices websvpro, websvcdev, packages, olog, www-acc, www.acc, artifcats, builder, ...
  • virtual machines
  • vmware server
  • zks
  • ...

We will stop all user facing services and start with the migration of fsl00c, this includes the home directories. Once migration of NFS is complete will continue with general operating system upgrades. Next we will need to upgrade our storage system fabric.

2017-06-29

dns alias fcmw00a will be removed

2017-03-20 to 2017-03-21

system maintenance complete

2017-01-06 to 2017-01-20

Datacenter relocation. All systems will be down.

Yes that is a time frame of two weeks. And yes you can expect issues once the systems are online again.

For anyone asking if systems are up again, the piggy bank is located at br2.2.152

fertig.jpg

Systems complete:

  • gsi oracle databases
  • gsi ords application servers (cdb)
  • zks
  • www-acc.gsi.de (wiki, subversion, bugzilla)
  • acc6 (asl330-asl334, asl730-asl734)
  • nfs server (fsl00c)
  • acc7 (asl340-asl344, asl740-asl744)
  • websites: websvcpro, webvscdev, packages, olog, ...
  • vmware machines (vmla...)
  • ovirt machines (jenkins, builder.acc.gsi.de)
  • vms (axp, bel.gsi.de)

Change of ACC uplink Network unavailable from 7:00 to 7:30

2016-11-30 oracle migration

ACC Oracle Databases will be migrated to new storage systems. Databases will be unavailable.

2016-08-29 maintenance week

maintenance complete.

Known Issues:

  • x-win32@acc6: x-win32 is incompatible with Redhat Enterise Linux 6.8.
    If you need to connect to acc6 (asl73x, asl33x) please use a workaround
    • use cygwin
    • connect to acc7 using x-win32 and connect to acc6 via ssh and x forwarding.
    • a newer x-win32 version will fix this, but is not (yet) available at gsi
    • new version of x-win32 available at softwarecenter

2016-06 jenkins

the buildserver will should be migrated to new hardware end of june?

2016-03-15 reset scuxl

all scus will be resetted

2016-03-14 maintenance week

maintenance complete.

Starting Monday 14. march to Friday 18.

Software updates and reboots of all services.

status 2016-03-15:
  • acc6 done; asl73x, asl33x, asl43x
  • webserver done, wsl00x
  • logstash done; usl30x
  • timing done; tsl001
  • usl60x done
  • vml00x done
  • vml200x done
  • wincc done; psl004
  • anything zkl related

2016-03-07 fsl00t retirement

NFS Server fsl00t will be retired.

2016-01-25 maintenance week

Starting Monday 25. Jan to Friday 29

System maintenance complete

  • fibre channel core switches relocated
  • 12 servers moved
  • over 150 cables layed
  • over 50 servers updated

2016-01-04 subversion structure

subversion will be restructured.

The repository bel will be frozen.

Commit all changes before and prepare to create new workspaces.

See also SubversionStructure

status 2016-01-04:
  • new repositories created
  • bel renamed to bel-archiv
  • bel-archiv is read only
  • some repository data migrated (including history)
  • please migrate other data (excluding history) using svn export and svn import

2015-12-07 maintenance week

Starting Monday 07. Dec to Friday 11

System maintenance. Expect major service interruptions.

Status:
  • wsl00x done -> www-acc, wiki, subversion
  • asl43x done -> NFS, home,
  • asl33x done -> acc6-PRO, artifacts
  • asl73x done -> lsa server
  • usl30x done -> logstash
  • asl102 done -> interlock
  • usl602 done -> buildserver
  • usl603 done -> virtual scu
  • tsl001 done -> timing
  • zkl00x done -> zks
  • vml00x done -> fesa build, rpmbuild, etc

pending:
  • zks terminals reboot (updates are done)

maintenance week closed.

2015-11-30 webserver migration

Starting 10:00, expecting 4h of downtime.

migration of www-acc.gsi.de and www.acc.gsi.de to new hardware.

failure of wiki, subversion, bugzilla and other webservices using these domains.

Migration is complete. https certificate and kerberos tickets are working.

2015-07-23 Upgrade Artifacts

Upgrade artifacts.acc.gsi.de. Starting 10:00 expecting 2h downtime

2015-05-04

Starting Monday 04. May to Friday 08

System maintenance. Expect major service interruptions.

System updates. Network Firmware updates. Will shutdown acc6 and acc5 cluster for a few hours.

Status update 2015-05-04: network and storage switches are patched, acc5 and acc6 are patched. Database systems will be patched on Tuesday

Status update 2015-05-05: maintenance complete

2015-04-07

Default java will switch to java 8. Typing "java" will result in a java 8 runtime.

Default eclipse version will be luna. Start with "eclipse-luna" The alias "eclipse", currently pointing to kepler, will be removed

2015-03-16

Starting Monday 16. Mar to Friday 20

System maintenance. Expect major service interruptions.

Depending on completion of electric power installation, the current plans include physical movement of storage systems and bladecenters to new racks.

Interruption will include acc5 (asl72x) acc6 (asl73x), webservers (wsl00x), fileservers (fsl00c, fsl00t), network boot, oracle databases (acc and gsi), zks, ...

Status update 2015-03-18:

Blade enclosures and storage systems have been moved. Most user facing services/machines are still powered down. We expect to restore main services (acc5 and acc6) on Thursday.

Status update 2015-03-19:
  • acc6 is up and running
  • acc5 has a hardware defect, expected to be running friday
  • other services (jenkins, logstash) are powered off
Status update 2015-03-20:
  • acc6 is up and running
  • acc5 is up and running
  • other services (jenkins, logstash) are powered off

2015-01-12

Starting Monday 12. Jan to Friday 16. System maintenance. Expect major service failures.

Current plans include physical movement of servers and blade centers. Expect hours/days of downtime for linux cluster acc5, acc6, NFS servers fsl00t, fsl00c, tftp services, etc.

Status update 2015-01-13: Software updates are mostly complete. Waiting for electric installation to finish before moving servers.

Status update 2015-01-19: Electric installation not completed. New maintenance window in march

2014-11-11

all el6 systems received an subversion upgrade to 1.7. For details see Subversion

2014-10-23

For security reasons caused by the ssl poodle bug, SSLv3 has been deactivated on our webservers. This affects the subversion connection of eclipse-indigo on the acc5 cluster. Subversion access on acc5 is only possible using the command line client.

2104-08-06

new java version 1.7.0_67 on acc6. Check your eclipse settings.
This update fixes security issues and a java webstart bug.
Starting with this release eclipse settings should be stable during upgrades.

2014-07-24

acc6-file had a failure from 16:40 to 16:50. This affected for example all home filesystems on acc6-pro and acc6-dev.

One of the cluster protocols was lacking redundancy. Configuration changes to solve this issue froze a cluster service and a reboot was required. Sorry for the interruption. Reboot of all acc6 clusters (file, pro, dev) is complete.

2014-05-26

System maintenance complete

2014-05-21

2014-05-21, 2014-05-22, 2014-05-23, 2014-05-26 system maintenance. Expect service failures. date changed maintenance now includes friday

2014-05-13

new java version 1.7.0_55 on acc6. Check your eclipse settings

2014-05-08

webserver certificates for www-acc.gsi.de and www.acc.gsi.de have been refreshed. To update subversions cache use svn info https://www-acc.gsi.de/svn/bel

2014-03-26

new java version 1.7.0_51. Check your eclipse settings.

2013-11-25 System updates.

2013-11-25 from 08:00 to estimated 2013-11-28 18:00

Affected: all machines and services

2013-08-26

2013-08-26 from 08:00 to estimated 2013-08-26 14:00

Affected: Blade chassis blc292 with the hosts asl73x, psl00x

Modification of enclosure network uplink.

2013-07-08

Systemaktualisierung.

Zeitraum 2013-07-08 ab 08:00 bis vorraussichtlich 2013-07-10 16:00

Betroffen sind von IN betreuten Server. Komplettausfall von acc6, webserver und datenbanken. Service unterbrechungen von allen weiteren IN Diensten. Cluster (acc5, acc6), Webserver (subversion, wiki), Buildsystem (maven repository, jenkins).

2013-05-22

Migration auf JDK7 auf allen Systemen.

2013-04-08

Systemaktualisierung.

Zeitraum 2013-04-08 ab 08:00 bis vorraussichtlich 2013-04-10 16:00

Betroffen sind von IN betreuten Server. Komplettausfall von acc6, webserver und datenbanken. Service unterbrechungen von allen weiteren IN Diensten. Cluster (acc5, acc6), Webserver (subversion, wiki), Buildsystem (maven repository, jenkins).

Update 2013-04-08 14:00:
Ein Firmware upgrade im SAN ist fehlgeschlagen, wir nehmen fuer heute erstmal alle Dienste hoch und werden uns morgen nochmal mit dem upgrade beschaeftigen. Mit erneuten Ausfaellen ist zu rechnen.

2013-01-14

Systemaktualisierung.

Zeitraum 2013-01-14 ab 08:00 bis vorraussichtlich 2013-01-16 16:00

Betroffen sind von IN betreuten Server. Die Cluster (acc5, acc6), Webserver (subversion, wiki), Buildsystem (maven repository, jenkins). Sowie der Netzwerkbetrieb.

2013-01-07

Aufgrund von Wartungsarbeiten an der Klimaanlage Abschaltung aller IN Server.

Zeitraum 2013-01-07 ab 16:00 bis vorraussichtlich 2013-01-08 16:00

Betroffen sind alle von IN betreuten Server. Alle Cluster (axp, acc5, acc6), Webserver (subversion, wiki), Buildsystem (maven repository, jenkins)

Es wird versucht den Netzwerkbetrieb aufrecht zu erhalten.

2012-12-11

Aufgrund von Wartungsarbeiten an der Klimaanlage Abschaltung aller IN Server.

Zeitraum 2012-12-11 ab 16:00 bis vorraussichtlich 2012-12-12 16:00

Betroffen sind alle von IN betreuten Server. Alle Cluster (axp, acc5, acc6), Webserver (subversion, wiki), Buildsystem (maven repository, jenkins)

Es wird versucht den Netzwerkbetrieb aufrecht zu erhalten.
Topic revision: r209 - 31 Jan 2024, ChristophHandel
This site is powered by FoswikiCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Foswiki? Send feedback