NSX-T manager fails to load? It might be that the Corfu DB got corrupted

If you’re like me, and you are spinning new nested labs left and right, you are also probably over-committing on your VMFS datastore regularly.

The issue that happened to me was that I ran out of datastore space and it crashed my NSX-T manager. Perhaps this issue can also happen for other reasons. In any case the issue manifests itself by not being able to login to the NSX-T manager where it keeps saying that the service is not ready.

NSX Manager error

When runing the command “get management-cluster status” on the NSX-T manager you may get:

Number of nodes in management cluster: UNKNOWN

Management cluster status: INITIALIZING

Number of nodes in control cluster: UNKNOWN

This problem can heppn becuse the Corfu DB in NSX-T has failed to load. In the case of running out of datastore space it almost certainly a corruption in a record in the database. 2018-01-22 16_15_37-Corrupt red stamp text stock vector. Illustration of icon - 43173641

So how do we identify and resolve this issue?

Follow these steps:

  1. ssh in to the NSX manager using user:admin
  2. cd to /config/corfu/log/ directory. Here you should see the log files serially named. (example 280.log, 281.log,…)
  3. Recommended to take a backup of the folder using cp -R /config/corfu/log/ /config/corfu/log.backup
  4. In the appliance there is a log reader tool. use it to read teh latest log. e.g. corfu_logReader display <log file name> (example 281.log)
  5. If the DB is corrupt the log (which might take a while to roll) will exit with an error. The output of this command will look something like the following:

Rank: 0, UUID: 0 0Checkpoint type: START, ID 59364361-ffa1-4feb-9b39-b70a38bade90Record length 303 checksum 982ab0f8Global address: 2821291Log Entry streams (1):  05d2af53-4368-3308-b389-b8d246bb20f7DataType: DATAData:B\x0a\x02Y6Ca\xff\xa1O\xeb\x9b9\xb7\x0a8\xba\xde\x903\x7c\x2d\xd7Nf1f\x80\x9d\xe3\xba\x99O\xd08\x00\x172018\x2d01\x2d06\x2013\x3a18\x3a20\x2e435\x00\x06\x00\x00\x172018\x2d01\x2d06T13\x3a18\x3a29\x2e571\x03\x00\x011\x05\x00\x072821289\x04\x00\x011\x01\x00\x172018\x2d01\x2d06T13\x3a18\x3a29\x2e577\x02\x00\x02\x2d1\x00\x00\x00\x00\x00\x00

Rank: 0, UUID: 0 0

Checkpoint type: END, ID 59364361-ffa1-4feb-9b39-b70a38bade90

Record length 269 checksum 8b0b408f

Global address: 2821292

Log Entry streams (1):  21db9a36-dbc3-322a-b564-d229eb1acb17

DataType: DATA

Data:

B\x0a\x00\xccb\x08\x94q\xa2\x40\x85\x9f\xea\x2f\xb0\x7f\xea\xec\x1d\x0e9q\xa6\xb8\x8e11\xaa\x81\x25\x80\xcbj\xf5\xb0\x00\x172018\x2d01\x2d06\x2013\x3a18\x3a20\x2e435\x00\x03\x00\x00\x172018\x2d01\x2d06T13\x3a18\x3a29\x2e583\x05\x00\x072821291\x02\x00\x02\x2d1\x00\x00\x00\x00\x00\x00

Rank: 0, UUID: 0 0

Checkpoint type: START, ID cc620894-71a2-4085-9fea-2fb07feaec1d

Record length 304 checksum 0c456e5d

Global address: 2821293

Log Entry streams (1):  21db9a36-dbc3-322a-b564-d229eb1acb17

DataType: DATA

Data:

B\x0a\x02\xccb\x08\x94q\xa2\x40\x85\x9f\xea\x2f\xb0\x7f\xea\xec\x1d\x0e9q\xa6\xb8\x8e11\xaa\x81\x25\x80\xcbj\xf5\xb0\x00\x172018\x2d01\x2d06\x2013\x3a18\x3a20\x2e435\x00\x06\x00\x00\x172018\x2d01\x2d06T13\x3a18\x3a29\x2e583\x03\x00\x011\x05\x00\x072821291\x04\x00\x011\x01\x00\x172018\x2d01\x2d06T13\x3a18\x3a29\x2e589\x02\x00\x02\x2d1\x00\x00\x00\x00\x00\x00

Rank: 0, UUID: 0 0

Checkpoint type: END, ID cc620894-71a2-4085-9fea-2fb07feaec1d

Incorrect delimiter

com.google.protobuf.InvalidProtocolBufferException: Protocol message contained an invalid tag (zero).

at com.google.protobuf.InvalidProtocolBufferException.invalidTag(InvalidProtocolBufferException.java:101)

at com.google.protobuf.CodedInputStream$ArrayDecoder.readTag(CodedInputStream.java:550)

at org.corfudb.format.Types$Metadata.<init>(Types.java:3703)

at org.corfudb.format.Types$Metadata.<init>(Types.java:3674)

at org.corfudb.format.Types$Metadata$1.parsePartialFrom(Types.java:4183)

at org.corfudb.format.Types$Metadata$1.parsePartialFrom(Types.java:4178)

at com.google.protobuf.AbstractParser.parsePartialFrom(AbstractParser.java:139)

at com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:173)

at com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:185)

at com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:190)

at com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:49)

at org.corfudb.format.Types$Metadata.parseFrom(Types.java:3878)

at org.corfudb.logReader.logReader.processRecord(logReader.java:301)

at org.corfudb.logReader.logReader.nextRecord(logReader.java:326)

at org.corfudb.logReader.logReader.processLogFile(logReader.java:358)

at org.corfudb.logReader.logReader.readAll(logReader.java:129)

at org.corfudb.logReader.logReader.run(logReader.java:62)

at org.corfudb.logReader.logReader.main(logReader.java:46)

You can see that the last global address successfully printed was 2821293

6. Now we will try to erase the last corrupt entry using the following command: corfu_logReader erase 281.log –from=2821294

7. This should erase the corrupted entry. We can check this by displaying the file again. Now the tool should reach the end of the file without throwing any exceptions

8. Now start the corfu server: /etc/init.d/corfu-server start and the web UI should comeup fine

Best advice I can give any VMware admin is, keep your datastores with enough available space!

 

Leave a Reply

Your email address will not be published. Required fields are marked *