If you’re like me, and you are spinning new nested labs left and right, you are also probably over-committing on your VMFS datastore regularly.
The issue that happened to me was that I ran out of datastore space and it crashed my NSX-T manager. Perhaps this issue can also happen for other reasons. In any case the issue manifests itself by not being able to login to the NSX-T manager where it keeps saying that the service is not ready.
When runing the command “get management-cluster status” on the NSX-T manager you may get:
Number of nodes in management cluster: UNKNOWN
Management cluster status: INITIALIZING
Number of nodes in control cluster: UNKNOWN
This problem can heppn becuse the Corfu DB in NSX-T has failed to load. In the case of running out of datastore space it almost certainly a corruption in a record in the database.
So how do we identify and resolve this issue?
Follow these steps:
- ssh in to the NSX manager using user:admin
- cd to /config/corfu/log/ directory. Here you should see the log files serially named. (example 280.log, 281.log,…)
- Recommended to take a backup of the folder using cp -R /config/corfu/log/ /config/corfu/log.backup
- In the appliance there is a log reader tool. use it to read teh latest log. e.g. corfu_logReader display <log file name> (example 281.log)
- If the DB is corrupt the log (which might take a while to roll) will exit with an error. The output of this command will look something like the following:
Rank: 0, UUID: 0 0Checkpoint type: START, ID 59364361-ffa1-4feb-9b39-b70a38bade90Record length 303 checksum 982ab0f8Global address: 2821291Log Entry streams (1): 05d2af53-4368-3308-b389-b8d246bb20f7DataType: DATAData:B\x0a\x02Y6Ca\xff\xa1O\xeb\x9b9\xb7\x0a8\xba\xde\x903\x7c\x2d\xd7Nf1f\x80\x9d\xe3\xba\x99O\xd08\x00\x172018\x2d01\x2d06\x2013\x3a18\x3a20\x2e435\x00\x06\x00\x00\x172018\x2d01\x2d06T13\x3a18\x3a29\x2e571\x03\x00\x011\x05\x00\x072821289\x04\x00\x011\x01\x00\x172018\x2d01\x2d06T13\x3a18\x3a29\x2e577\x02\x00\x02\x2d1\x00\x00\x00\x00\x00\x00
Rank: 0, UUID: 0 0
Checkpoint type: END, ID 59364361-ffa1-4feb-9b39-b70a38bade90
Record length 269 checksum 8b0b408f
Global address: 2821292
Log Entry streams (1): 21db9a36-dbc3-322a-b564-d229eb1acb17
DataType: DATA
Data:
B\x0a\x00\xccb\x08\x94q\xa2\x40\x85\x9f\xea\x2f\xb0\x7f\xea\xec\x1d\x0e9q\xa6\xb8\x8e11\xaa\x81\x25\x80\xcbj\xf5\xb0\x00\x172018\x2d01\x2d06\x2013\x3a18\x3a20\x2e435\x00\x03\x00\x00\x172018\x2d01\x2d06T13\x3a18\x3a29\x2e583\x05\x00\x072821291\x02\x00\x02\x2d1\x00\x00\x00\x00\x00\x00
Rank: 0, UUID: 0 0
Checkpoint type: START, ID cc620894-71a2-4085-9fea-2fb07feaec1d
Record length 304 checksum 0c456e5d
Global address: 2821293
Log Entry streams (1): 21db9a36-dbc3-322a-b564-d229eb1acb17
DataType: DATA
Data:
B\x0a\x02\xccb\x08\x94q\xa2\x40\x85\x9f\xea\x2f\xb0\x7f\xea\xec\x1d\x0e9q\xa6\xb8\x8e11\xaa\x81\x25\x80\xcbj\xf5\xb0\x00\x172018\x2d01\x2d06\x2013\x3a18\x3a20\x2e435\x00\x06\x00\x00\x172018\x2d01\x2d06T13\x3a18\x3a29\x2e583\x03\x00\x011\x05\x00\x072821291\x04\x00\x011\x01\x00\x172018\x2d01\x2d06T13\x3a18\x3a29\x2e589\x02\x00\x02\x2d1\x00\x00\x00\x00\x00\x00
Rank: 0, UUID: 0 0
Checkpoint type: END, ID cc620894-71a2-4085-9fea-2fb07feaec1d
Incorrect delimiter
com.google.protobuf.InvalidProtocolBufferException: Protocol message contained an invalid tag (zero).
at com.google.protobuf.InvalidProtocolBufferException.invalidTag(InvalidProtocolBufferException.java:101)
at com.google.protobuf.CodedInputStream$ArrayDecoder.readTag(CodedInputStream.java:550)
at org.corfudb.format.Types$Metadata.<init>(Types.java:3703)
at org.corfudb.format.Types$Metadata.<init>(Types.java:3674)
at org.corfudb.format.Types$Metadata$1.parsePartialFrom(Types.java:4183)
at org.corfudb.format.Types$Metadata$1.parsePartialFrom(Types.java:4178)
at com.google.protobuf.AbstractParser.parsePartialFrom(AbstractParser.java:139)
at com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:173)
at com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:185)
at com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:190)
at com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:49)
at org.corfudb.format.Types$Metadata.parseFrom(Types.java:3878)
at org.corfudb.logReader.logReader.processRecord(logReader.java:301)
at org.corfudb.logReader.logReader.nextRecord(logReader.java:326)
at org.corfudb.logReader.logReader.processLogFile(logReader.java:358)
at org.corfudb.logReader.logReader.readAll(logReader.java:129)
at org.corfudb.logReader.logReader.run(logReader.java:62)
at org.corfudb.logReader.logReader.main(logReader.java:46)
You can see that the last global address successfully printed was 2821293
6. Now we will try to erase the last corrupt entry using the following command: corfu_logReader erase 281.log –from=2821294
7. This should erase the corrupted entry. We can check this by displaying the file again. Now the tool should reach the end of the file without throwing any exceptions
8. Now start the corfu server: /etc/init.d/corfu-server start and the web UI should comeup fine
Best advice I can give any VMware admin is, keep your datastores with enough available space!
Leave a Reply