Ceph storage niet beschikbaar/unavailable

Fixed

2 months ago —

NL: De RFO over dit incident is op 20 april gepubliceerd - RFO in Nederlands

EN: The RFO on this incident was published on April 20th - RFO in English

Fixed

2 months ago —

NL: Komende dagen wordt er nader onderzoek gedaan naar de exacte oorzaak van het incident. De expertise van één van de core developers van Ceph wordt hierbij ingeschakeld. Zodra dit onderzoek gereed is zal er een definitieve RFO op www.bit.nl gepubliceerd worden. Het is zeker dat het incident zich alleen voor heeft kunnen doen omdat er op dat moment capaciteit in het cluster toegevoegd werd. Totdat duidelijk is wat de oorzaak was zal er geen capaciteit toegevoegd worden en is er daarom geen risico op herhaling van het incident.

EN: The incident and its cause will be investigated further coming days. The expertise of one of the core developers of Ceph will be added to our team. As soon as the investigation reaches a conclusion a definitive RFO will be published on www.bit.nl. It is clear that the incident could only occur because at that moment extra capacity was being added to the cluster. Untill the root cause is known we will not add extra capacity and therefor there is no risk that the same incident will re-occur.

Fixed

2 months ago —

NL: Het cluster is weer operationeel. De backfills en recoveries zijn gereed en alle diensten zijn weer beschikbaar. De komende dagen zal er her en der nog wat gerebalanced worden in het cluster. In een enkel geval kan dat nog wat beperkte performance issues geven.

EN: The cluster is operational. The backfills and recoveries are done and all services are available. The cluster will be rebalanced coming days. Occasionally this might cause some minor performance issues.

Watching

2 months ago —

NL: Cephfs is ook weer beschikbaar. Alle diensten worden door engineers nagelopen en waar nodig worden ze herstart, mounts opnieuw gezet of worden er andere reparatiewerkzaamheden uitgevoerd.

EN: Cephfs is available. Engineers are checking all services and where necessary they will restart, remount or perform other repairs.

Identified

2 months ago —

NL: Vanwege het isoleren van PG's en het toevoegen van extra OSD's in het cluster, is Ceph nu aan het 'recoveren' en aan het 'backfillen'. Grote hoeveelheden data worden momenteel verplaatst en gerepliceerd. Binnenkort kunnen de Cephfs pools ook weer online gebracht worden. Shared storage binnen het virtualisatie platform draait op zo'n Cephfs pool. De virtual machines die geen Cephfs disk gebruiken zullen dus wel al functioneren.

EN: Because of PG isolation and adding extra OSD's in the cluster, Ceph is recovering and backfilling. Large quantities of data are being moved and replicated. Soon the Cephfs pools can be brought back online. Shared storage within the virtualisation platform runs on such a Cephfs pool. Those virtual machines that don't use such a Cephfs disk are functioning properly.

Identified

2 months ago —

NL: De meeste OSD's zijn weer beschikbaar. Een deel van de diensten functioneert ook weer maar kan weer onbeschikbaar worden zolang er nog aan het cluster gewerkt wordt.

EN: Most of the OSD's are back online. Some of the services are available but might become unavailable again while restoring the cluster.

Identified

2 months ago —

NL: Een groot deel van de PG's die nu geïsoleerd worden, zullen niet apart teruggeplaatst hoeven te worden omdat er andere OSD's zijn waar die PG wel beschikbaar is. Voor PG's die niet automatisch kunnen recoveren is uit tests duidelijk geworden dat ze wel apart geïmporteerd kunnen worden. Er wordt geen dataverlies verwacht als gevolg van dit incident.

EN: The majority of the PG's that are now being isolated, will recover automatically because there will be other OSD that have these PG's mapped. Tests show that PG's that won't recover automatically will import successfully later on. No data loss is expected from this incident.

Identified

2 months ago —

NL: Er komen steeds meer, voorheen crashende, OSD's weer beschikbaar. Het zal nog enige tijd duren voordat alle OSD's weer beschikbaar zijn. Er wordt nog onderzocht of de geïsoleerde PG's weer beschikbaar gemaakt kunnen worden.

EN: More and more previously crashing OSD's are coming available. It will take some time to get all OSD's online. Investigation on the possibility of recovering the problematic, isolated PG's is still going on.

Identified

2 months ago —

NL: De problematische PG's worden geïsoleerd van de crashende OSD's, waarna die OSD's naar verwachting wel zullen starten.

EN: The problematic PG's are being isolated from the crashing OSD's. After that those OSD should start.

Identified

2 months ago —

NL: Een deel van de placement groups (PG) veroorzaakt problemen en zorgt ervoor dat bepaalde OSD's niet starten. Als die PG uit de OSD gehaald wordt start de OSD wel. Er wordt nu onderzocht of dit voor alle crashende OSD's geldt en hoe de geïsoleerde PG alsnog toegevoegd kan worden aan een OSD. Als data op de storage wordt geplaatst, worden objecten gemapped naar PG's. PG's op hun beurt worden weer naar OSD's gemapped.

EN: Some of the placement groups (PG) are causing issues and this resulting in crashing OSD's. When those PG's get extracted from the OSD, the OSD does start. Ongoing investigation should point out whether all crashing OSD's are the result of these problematic PG's. Furthermore it will be investigated if the isolated PG's can be added later on to an OSD. When placing data in the cluster, objects are mapped into PGs, and those PGs are mapped onto OSDs.

Identified

2 months ago —

NL: Het herstarten van de OSD's heeft geen effect gehad, deze blijven crashen. Er wordt met externe experts verder gezocht naar een oorzaak en oplossing.

EN: Restarting all OSD's has not helped, the crashes continue. Together with a team of external experts continue to search for a cause and resolution.

Identified

2 months ago —

NL: Alle OSD's zullen een restart krijgen in een poging om meer OSD's beschikbaar te krijgen.

EN: All OSD's will be restarted in an attempt to get more OSD's available.

Identified

2 months ago —

NL: De Ceph monitoring daemons op alle Ceph monitoring servers zijn herstart. Er wordt daarna gepoogd de OSD's weer beschikbaar te maken.

EN: The Ceph monitoring daemons have been restarted. After that an attempt will be made to get the OSD's available again.

Identified

2 months ago —

NL: Een enkele OSD is weer beschikbaar gekomen, maar de rest van de OSD's niet. Middels debugging tools wordt verder uitgezocht wat het probleem veroorzaakt. Een OSD is de object storage daemon voor het gedistribueerde Ceph file systeem. Een OSD is verantwoordelijk voor het opslaan van objecten op een lokaal file systeem en geeft over het netwerk toegang tot de objecten.

EN: A single OSD is available again, but the other OSD's are not. Debugging tools are deployed to investigate the incident further. An OSD is the object storage daemon for the Ceph distributed file system. It is responsible for storing objects on a local file system and providing access to them over the network.

Identified

2 months ago —

NL: Onze eigen engineers en de externe Ceph specialisten zijn mogelijke oorzaken voor de verstoring aan het uitsluiten. Het is nog niet duidelijk wat de oorzaak wel is.

EN: Our own engineers with the external Ceph specialists are ruling out possible causes. It is still unclear what is causing the disruption.

Identified

2 months ago —

NL: Met de ingeschakelde externe specialisten worden de Ceph logfiles geanalyseerd. Zodra bekend is wat de conclusies van die analyse zijn, wordt de incidentmelding bijgewerkt.

EN: Together with the external specialists the Ceph log files are being analysed. As soon as more infomation from the investigation is available the incident notice will be updated.

Identified

2 months ago —

NL: OSD's in alle drie de failure domains zijn gecrasht en blijven crashen bij restarts. Het is nog niet duidelijk waar dit door veroorzaakt wordt. Externe expertise wordt hiervoor nu ingeschakeld.

EN: OSD's in all three failure domains have crashed en continue to crash when restarting. It is still unclear why this happens. External expertise is being contacted.

Identified

2 months ago —

NL: Alle diensten van BIT die gebruik maken van shared Ceph storage zijn getroffen door dit incident. Engineers van BIT onderzoeken de oorzaak van de verstoring.

EN: All BIT services depending on Ceph storage are affected by this incident. BIT engineers are pinpointing the cause of the outage.

2 years ago —

NL: Shared Ceph storage is onbeschikbaar. Diverse diensten maken gebruik van deze storage, waaronder virtuele machines en virtuele datacenters. Zie de laatste updates hier direct onder.

EN: Shared Ceph storage is unavailable. Multiple services rely on this storage, among them virtual machines and virtuale datacenters. Please find the last update directly below.