Add new StreamReader.readuntil() method. #297

socketpair · 2015-11-29T19:53:07Z

No description provided.

gvanrossum · 2015-11-29T23:51:49Z

asyncio/streams.py

The opening """ should be on the same line as the first word of the docstring.

gvanrossum · 2015-12-01T18:05:26Z

Ping? Any updates?

socketpair · 2015-12-01T21:12:19Z

@gvanrossum pong. There is nothing to change in that PR as I think. If not - please say what should be done.

gvanrossum · 2015-12-03T22:27:18Z

Here's what I recommend: instead of adding an optional argument to readline(), leave readline() alone and add a new method readrecord() with a mandatory trailing separator. You can then change the policies of readrecord() to be more strict when it encounters premature EOF or a record that's too long (or you could ignore the limit -- after all read() and readexactly() ignore the limit too).

socketpair · 2015-12-08T20:25:06Z

This is pleriminary fix. I don't know if I should change readline code ever. Now, all test passed and no code duplication. Also, I use offset instead of merging/slicing buffers - this is faster than previous implementation.

So, I'm asking your to:

Decide, should I leave original readline implementation or not
make a quick code-review + English-related fixes.

After that I will change commit message, squash and so on.

I'm happy to change anything.

socketpair · 2015-12-09T21:12:05Z

@gvanrossum ping

gvanrossum · 2015-12-09T22:07:38Z

Sorry, I've been alternatingly busy and sick. But I'll get to this
eventually!

On Wed, Dec 9, 2015 at 1:12 PM, Коренберг Марк [email protected]
wrote:

@gvanrossum https://github.com/gvanrossum ping

—
Reply to this email directly or view it on GitHub
#297 (comment).

--Guido van Rossum (python.org/~guido)

1st1 · 2015-12-11T15:10:28Z

Why readrecord? When I think about records I assume that they have constant sizes, or are encoded in [header+body-len][body] format.

Maybe readuntil would be a better name?

socketpair · 2015-12-11T17:24:11Z

@1st1 agree. Tornado call it read_until. I wIll rename to same function, but without underscore, as you said.

1st1 · 2015-12-11T17:31:16Z

BTW, do we consume the separator? Should we add a bool flag to either leave it in the buffer or consume it?

socketpair · 2015-12-11T17:39:02Z

@1st1 could you provide case when we should preserve separator ?

I mean two different things - preserving it in internal buffer and prserving at the end of returned value.

1st1 · 2015-12-11T17:48:45Z

Does readrecord in your current patch consume the separator (i.e. will it be left for future consumption in the internal StreamReader buffer)? The docstrting doesn't mention that.

@1st1 could you provide case when we should preserve separator ?

I don't know any concrete use case for that, but I can imagine that being useful for separators more complex than \r\n. But I think this is something we can always add later.

socketpair · 2015-12-11T18:02:32Z

Yes, if complete record is read successfully, separator is removed both from buffer and from returned value.

In any case, it is guaranteed, that removed separator is the same as argument of the function, so data is not lost.

gvanrossum · 2015-12-11T18:07:42Z

It should consume (and return) the separator, similar to readline(). (The
one difference is at EOF -- IIUC if readuntil() sees EOF before seeing a
full separator it should raise EOFError (and also if it sees EOF
immediately -- similar to readexactly()).

On Fri, Dec 11, 2015 at 9:48 AM, Yury Selivanov [email protected]
wrote:

Does readrecord in your current patch consume the separator (i.e. will it
be left for future consumption in the internal StreamReader buffer)? The
docstrting doesn't mention that.

@1st1 https://github.com/1st1 could you provide case when we should
preserve separator ?

I don't know any concrete use case for that, but I can imagine that being
useful for separators more complex than \r\n. But I think this is
something we can always add later.

—
Reply to this email directly or view it on GitHub
#297 (comment).

--Guido van Rossum (python.org/~guido)

gvanrossum · 2015-12-11T18:08:19Z

Oh, I see, that's different from readline(). Do you have a reason for that?

On Fri, Dec 11, 2015 at 10:02 AM, Коренберг Марк [email protected]
wrote:

Yes, if complete record is read successfully, separator is removed both
from buffer and from returned value.

In any case, it is guaranteed, that removed separator is the same as
argument of the function, so data is not lost.

—
Reply to this email directly or view it on GitHub
#297 (comment).

--Guido van Rossum (python.org/~guido)

socketpair · 2015-12-12T11:23:44Z

@gvanrossum yes, I have reason. First, this function will be used to read message of unknown size from a stream. I.e multipart form data or end of http header or even end of http headers(\r\n\r\n). In all cases, separator is a some protocol defined delimiter, it have no meaning in data it service for.

Next, storing separator in returned value have meaning for readline api, since presense of newline at end of string is only way to detect if complete line is read. In our case, incomplete line reading is acknowledged by exception. the same for overlimited line. In other words, for my function, if something is returned, it is guaranteed that this is complete data.

Next, after reading of such chunk, users will validate, parse it or pass to struct.unpack. In both cases, user does not need separator and he will cut it. Even for logging it is required to cut newline.

So, my function consume separator, but not return it. Seems this is very suitable for all tasks. If not, please provide example. Yes, it is not traditional (tornado returns separator), but much more convenient. In extra case, user may manually append separator to returned value before next processing of data, since separator was passed as argument. But I'm consider this is very rare case or even impossible in real situations. I have experience in writing protocol implementations.

1st1 · 2015-12-12T15:46:15Z

@socketpair All good points.

OTOH, I still don't like this subtle difference between readline and readuntil. Which one of them returns the delimiter is something that I will constantly forget. Removing the delimiter is a matter of doing a simple slice operation. Tornado, by the way, doesn't strip the delimiter: http://www.tornadoweb.org/en/stable/iostream.html#tornado.iostream.BaseIOStream.read_until

socketpair · 2015-12-12T18:37:09Z

@1st1 I have renamed function

socketpair · 2015-12-12T18:45:37Z

@1st1 and @gvanrossum. Will you accept this new function with current separator behaviour?

socketpair · 2015-12-12T18:46:43Z

asyncio/streams.py

@1st1 and @gvanrossum Is this exception's name good enough? I have problems with naming and with my english.

gvanrossum · 2015-12-12T19:12:24Z

I am almost happy with this. Left a few comments. Thanks for persevering (and sorry for being so slow -- being alternatingly sick and busy is no fun).

socketpair · 2015-12-12T19:54:02Z

@gvanrossum you should not apologize, everyone understands your workload. :) I'm happy that you personally review my code.

1st1 · 2015-12-14T20:39:58Z

asyncio/streams.py

@gvanrossum @socketpair Why do we leave the data in the buffer? Suppose one implements a MIME parser, and tries to read until they see a boundary. Would it make sense for them to handle the LimitOverrunError, and instead of closing the connection, trying to recover with just reading the data? Is there a realistic use case for this?

The reason I'm asking this, is because I don't like that readline and readuntil will behave differently.

In previous implementation there was configurable limit. now it has gone. So, user may make a warning, and try to read with bigger limit, or, read by portions to skip that bad data...or, maybe read until another separator, trying to recover from error.

One of the most unwanted feature of readline - is to leave stream in unpredicatable state when such error occur.

I can remove that odd (as you think) behaviour, but in that case I should add closing of connection right before raising such exception. (at least, disable any readings and simulate EOF).

[..] and try to read with bigger limit, or, read by portions to skip that bad data...or, maybe read until another separator, trying to recover from error.

Unfortunately I haven't seen such an advanced code yet. Usually the code just crashes, and the connection gets closed :)

One of the most unwanted feature of readline - is to leave stream in unpredicatable state when such error occur.

Agree. But... The readline method before your PR did exactly that? Sorry, I'm a bit confused now.

Yes, readline before me just clears buffer (sometimes, depending on fact it found newline or not), and was not closing connection. You may see original, not so obvious code.

I think this is fine, even though it is an obscure border case. There is actually an alternative way to read that buffered data -- using plain read().

1st1 · 2015-12-14T20:45:33Z

@socketpair there are a bunch of questions to be resolved before merging this. Please read my comments. Sorry it takes so long for this PR, but it's a sizeable rewrite of the existing code + new methods. We have to be careful here. Thanks!

socketpair · 2015-12-14T21:13:55Z

OK, I understand you in your requirements, this is OK. My PR in nodejs tooks about 4 months :) and succeeded after that (so, I have discovered one other megabug while writing test on bug I found (assertion failed in C code))

1st1 · 2015-12-14T21:16:29Z

asyncio/streams.py

Please add a comment explaining what's going on here.

There was a comment, saying that this restore original behaviour of readline (see - I even did not touch original tests). Since "original behaviour" is unknown for people seeing that comment in future, I decide to remove comment.

I think the comment(s) should be restored. The only reason we have these two except clauses is for backwards compatibility, so we should state that. Otherwise in the future people may waste time trying to understand the reason or try to fix it and accidentally break code that depends on the old way. (Actually, the return e.partial is for compatibility with readline() methods on synchronous I/O classes; the backwards compatibility concern applies to deleting the consumed portion of the buffer only.)

(Alternatively, in the future people may legitimately decide to leave the excess data in the buffer for the LimitOverrunError case -- it helps that the only reason to not do that was backwards compatibility, not anything deeper.)

socketpair · 2015-12-20T16:10:17Z

Seems, I have fixed everything requested, except long comment and offset = buflen - seplen. Will it be merged in current state, or it is required to fix something ?

I speak about REQUIREMENTS, i.e. changes, without of which you will not merge.

1st1 · 2015-12-21T17:44:08Z

Seems, I have fixed everything requested, except long comment and offset = buflen - seplen. Will it be merged in current state, or it is required to fix something ?

I left you a comment re offset = buflen - seplen.

This function overcome ReadStream.readline() limitation on separator value. Also update docstrings of read*() functions and do some refactoring on them.

socketpair · 2015-12-21T18:50:16Z

Well, only question leaft is a big comment. Will you merge with that? Also, when I remove that +1 tests hang (!). I now discovering why.

1st1 · 2015-12-21T19:04:33Z

Well, only question leaft is a big comment. Will you merge with that?

We'll merge the PR, it's only a matter of time to fix the remaining nits!

Also, when I remove that +1 tests hang (!). I now discovering why.

That might be a good reason to keep that + 1. Or, maybe, it's a hint that there is another bug. Please investigate.

gvanrossum · 2016-01-04T22:45:41Z

Happy New Year everybody! There seems to be one open issue here (the investigation into why that +1 is needed). I will try to conduct a separate review. Then we can hopefully merge this. (Personally I have no problem with the big comment.)

gvanrossum · 2016-01-04T23:34:38Z

I looked into the reason for the + 1 and it is correct. The reason the test hangs is simple: It sets the limit to 7, fills the buffer with 8 bytes (none of which are the separator, which has size 1), and then expects LimitOverrunError to be raised. Using offset = bufsize + 1 - seplen, offset becomes 8, which is > 7, so the exception is raised. But without + 1, offset is only 7, which doesn't cause the exception. But the test neither sets EOF not adds more data, so then readuntil() simply hangs (there is no higher-level timeout in the test or in the stream code).

gvanrossum · 2016-01-04T23:46:13Z

asyncio/streams.py

Maybe we should not clear the buffer here? The same argument applies -- there's another way to read the rejected data if an app really wants it. That may be better than stuffing it in the exception.

@gvanrossum

Maybe we should not clear the buffer here?

It is compatibility with readexactly.

If LimitOverrun occurs, this may not mean some protocol violation (say, in HTTP there are no limits on length in various places). This exception is not "fatal", and we should preserve data in order to continue. For example, users of library may issue warning in logs and try to re-read with bigger limit. UPD: Specifying limit in that function was removed (but I do not agree with that, but remove it in order to be merged).

If IncompleteReadError occurs, this mean protocol violation (or sudden disconnection). That mean requested operation can never complete. Yes, someone may want to re-read data in different way, but I can not provide real use case (as opposed to LimitOverrun case). Partially fetched record can not be used in a general, so this exception is "fatal". If someone wants to inspect what data still have been fetched - it have ability to. If I remove buffer cleanup, and user wants to get that partial data, he will required to call .read() in a loop in order to fetch everything unread, since we can not get amount of data in buffer (and call readexactly). Moreover, it is common to use this function to read data line-by-line, and so users should distinguish between empty and non-empty data after this exception happen. In other words, having data inside THIS exception is very suitable.

gvanrossum · 2016-01-08T20:19:26Z

OK, you've convinced me. LGTM. Maybe Yury has some final remarks? Yury, do you want to merge?

1st1 · 2016-01-08T20:33:40Z

@gvanrossum Yes, I like this PR, and the new readuntil method. Please let me to experiment with the patch over the weekend and I can merge it on Monday.

@socketpair Would you be able to create an issue on bugs.python.org and draft a patch for Python documentation (update asyncio docs, whatsnew, etc)?

socketpair · 2016-01-08T20:38:50Z

@1st1

Thanks to spend time with my code :)
Yes, I able to. http://bugs.python.org/issue26050
Please point me which files I should update

asvetlov · 2016-01-08T20:47:41Z

I believe it's https://github.com/python/cpython/blob/master/Doc/library/asyncio-stream.rst for doc update

1st1 · 2016-01-11T17:16:56Z

Mark, I've committed your changes in 47fafbb.

Thank you very much for your hard work on this PR. Please upload a draft patch with documentation for readuntil and LimitOverrunError to http://bugs.python.org/issue26050.

socketpair · 2016-01-14T15:36:26Z

@1st1 Patch added http://bugs.python.org/file41595/asyncio-stream-doc-preliminary.patch

gvanrossum reviewed Nov 29, 2015
View reviewed changes

1st1 changed the title ~~ReadStream.readline() now supports custom separators~~ Add new StreamReader.readuntil() method. Dec 12, 2015

socketpair reviewed Dec 12, 2015
View reviewed changes

1st1 reviewed Dec 14, 2015
View reviewed changes

1st1 mentioned this pull request Dec 14, 2015

Consider adding limit parameters to StreamReader.readline and .readuntil. #304

Open

1st1 reviewed Dec 14, 2015
View reviewed changes

1st1 closed this Dec 21, 2015

1st1 reopened this Dec 21, 2015

Introduce ReadStream.readuntil()

9926391

This function overcome ReadStream.readline() limitation on separator value. Also update docstrings of read*() functions and do some refactoring on them.

gvanrossum reviewed Jan 4, 2016
View reviewed changes

1st1 closed this Jan 11, 2016

socketpair deleted the separ branch January 14, 2016 15:38

Uh oh!

Add new StreamReader.readuntil() method. #297

Add new StreamReader.readuntil() method. #297

Uh oh!

Conversation

socketpair commented Nov 29, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gvanrossum commented Dec 1, 2015

Uh oh!

socketpair commented Dec 1, 2015

Uh oh!

gvanrossum commented Dec 3, 2015

Uh oh!

socketpair commented Dec 8, 2015

Uh oh!

socketpair commented Dec 9, 2015

Uh oh!

gvanrossum commented Dec 9, 2015

Uh oh!

1st1 commented Dec 11, 2015

Uh oh!

socketpair commented Dec 11, 2015

Uh oh!

1st1 commented Dec 11, 2015

Uh oh!

socketpair commented Dec 11, 2015

Uh oh!

1st1 commented Dec 11, 2015

Uh oh!

socketpair commented Dec 11, 2015

Uh oh!

gvanrossum commented Dec 11, 2015

Uh oh!

gvanrossum commented Dec 11, 2015

Uh oh!

socketpair commented Dec 12, 2015

Uh oh!

1st1 commented Dec 12, 2015

Uh oh!

socketpair commented Dec 12, 2015

Uh oh!

socketpair commented Dec 12, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gvanrossum commented Dec 12, 2015

Uh oh!

socketpair commented Dec 12, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

1st1 commented Dec 14, 2015

Uh oh!

socketpair commented Dec 14, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

socketpair commented Dec 20, 2015

Uh oh!

1st1 commented Dec 21, 2015

Uh oh!

socketpair commented Dec 21, 2015