Read Protobuf Serialization of StanfordNLP Output in Python

Question

I would like to output the StanfordNLP results in protobuf (since its size is much smaller) and read the results back in python. How should I do that?

I followed the instruction here to output the results serialized with ProtobufAnnotationSerializer, like this:

java -cp "stanford-corenlp-full-2015-12-09/*" \
edu.stanford.nlp.pipeline.StanfordCoreNLP \
-annotators tokenize,ssplit \
-file input.txt \
-outputFormat serialized \
-outputSerializer \
edu.stanford.nlp.pipeline.ProtobufAnnotationSerializer

Then use protoc to compile the CoreNLP.proto, which comes with the source code of StanfordNLP, into python modules like this:

protoc --python_out=. CoreNLP.proto

Then in python I read the files back like this:

import CoreNLP_pb2
doc = CoreNLP_pb2.Document()
doc.ParseFromString(open('input.txt.ser.gz', 'rb').read())

The parsing fails with the following error message

---------------------------------------------------------------------------
DecodeError                               Traceback (most recent call last)
<ipython-input-213-d8eaeb9c2048> in <module>()
      1 doc = CoreNLP_pb2.Document()
----> 2 doc.ParseFromString(open('imed/s5_tokenized/conv-00000.ser.gz', 'rb').read())

/usr/local/lib/python2.7/dist-packages/google/protobuf/message.pyc in ParseFromString(self, serialized)
    183     """
    184     self.Clear()
--> 185     self.MergeFromString(serialized)
    186 
    187   def SerializeToString(self):

/usr/local/lib/python2.7/dist-packages/google/protobuf/internal/python_message.pyc in MergeFromString(self, serialized)
   1092         # The only reason _InternalParse would return early is if it
   1093         # encountered an end-group tag.
-> 1094         raise message_mod.DecodeError('Unexpected end-group tag.')
   1095     except (IndexError, TypeError):
   1096       # Now ord(buf[p:p+1]) == ord('') gets TypeError.

DecodeError: Unexpected end-group tag.

UPDATE:

I asked the author of the serializer Gabor Angeli and got the answer. The protobuf objects were written to the files with writeDelimitedTo in this line. Changing it to writeTo would make the output files readable in Python.

That may also be an issue (not sure which version was used to generate the .java files) but see my answer first because that is a problem too. — sberry
– sberry, Commented Sep 11, 2016 at 6:09
@sberry: The proto file does not include a proto version specification, and the compiler compiled it as proto2, which is correct because the proto file has the "optional" keyword. — shaoyl85
– shaoyl85, Commented Sep 11, 2016 at 6:10
@sberry: I got the answer from asking Gabor and I put it in the post. Thank you still for helping me :) — shaoyl85
– shaoyl85, Commented Sep 11, 2016 at 18:59

Gabor Angeli · Accepted Answer · 2016-12-04 22:12:55Z

This question seems to have come up again, so I figured I'd write up a proper answer. The root of the issue is that the proto is written using Java's writeDelimitedTo method, which Google has not implemented for Python. A workaround would be to use the following method to read the proto file (assuming the file is not gziped -- you can replace f.read() with the appropriate code to unzip the file as appropriate):

from google.protobuf.internal.decoder import _DecodeVarint
import CoreNLP_pb2

def readCoreNLPProtoFile(protoFile):
  protos = []
  with open(protoFile, 'rb') as f:
    # -- Read the file --
    data = f.read()
    # -- Parse the file --
    # In Java. there's a parseDelimitedFrom() method that makes this easier
    pos = 0
    while (pos < len(data)):
      # (read the proto)
      (size, pos) = _DecodeVarint(data, pos)
      proto = CoreNLP_pb2.Document()
      proto.ParseFromString(data[pos:(pos+size)])
      pos += size
      # (add the proto to the list; or, `yield proto`)
      protos.append(proto)
  return protos

The file CoreNLP_pb2 is compiled from the CoreNLP.proto file in the repo with the command:

protoc --python_out /path/to/output/ /path/to/CoreNLP.proto

Note that as of writing this (version 3.7.0) the format is proto2, not proto3.

Zhen Tian · Accepted Answer · 2021-12-29 02:50:52Z

0

There is a simple solution in Golang, assume the raw data is "data" and parsed to "msg":

import (
   "google.golang.org/protobuf/proto"
   "google.golang.org/protobuf/reflect/protoreflect"
   "google.golang.org/protobuf/encoding/protowire"
)

func CoreNLPUnmarshal(data []byte, msg 
protoreflect.ProtoMessage) error {
    bs, n := protowire.ConsumeBytes(data)
    if n < 0 {
        return protowire.ParseError(n)
    }
    return proto.Unmarshal(bs, msg)
}

answered Dec 29, 2021 at 2:50

Zhen Tian

1

1 Comment

R. Marolahy Over a year ago

The question requires an answer in Python

Collectives™ on Stack Overflow

Read Protobuf Serialization of StanfordNLP Output in Python

2 Answers 2

Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related