Hacker News Clone new | comments | show | ask | jobs | submit | github repologin
Why pipes sometimes get "stuck": buffering (jvns.ca)
124 points by tanelpoder 3 hours ago | hide | past | web | 32 comments | favorite

The solution is that buffered accesses should almost always flush after a threshold number of bytes or after a period of time if there is at least one byte, “threshold or timeout”. This is pretty common in hardware interfaces to solve similar problems.

In this case, the library that buffers in userspace should set appropriate timers when it first buffers the data. Good choices of timeout parameter are: passed in as argument, slightly below human-scale (e.g. 1-100 ms), proportional to {bandwidth / threshold} (i.e. some multiple of the time it would take to reach the threshold at a certain access rate), proportional to target flushing overhead (e.g. spend no more than 0.1% time in syscalls).

Also note this applies for both writes and reads. If you do batched/coalesced reads then you likely want to do something similar. Though this is usually more dependent on your data channel as you need some way to query or be notified of “pending data” efficiently which your channel may not have if it was not designed for this use case. Again, pretty common in hardware to do interrupt coalescing and the like.

I think this is the right approach, but any libc setting automatic timers would lead to a lot of tricky problems because it would change expectations.

I/O errors could occur at any point, instead of only when you write. Syscalls everywhere could be interrupted by a timer, instead of only where the program set timers, or when a signal arrives. There's also a reasonable chance of confusion when the application and libc both set timer, depending on how the timer is set (although maybe this isn't relevant anymore... kernel timer apis look better than I remember). If the application specifically pauses signals for critical sections, that impacts the i/o timers, etc.

There's a need to be more careful in accessing i/o structures because of when and how signals get handled.

Typical Linux alarms are based on signals and are very difficult to manage and rescheduling them may have a performance impact since it requires thunking into the kernel. If you use io_uring with userspace timers things can scale much better, but it still requires you to do tricks if you want to support a lot of fast small writes (eg > ~1 million writes per second timer management starts to show up more and more and you have to do some crazy tricks I figured out to get up to 100M writes per second)

You do not schedule a timeout on each buffered write. You only schedule one timeout on the transition from empty to non-empty that is retired either when the timeout occurs or when you threshold flush (you may choose to not clear on threshold flush if timeout management is expensive). So, you program at most one timeout per timeout duration/threshold flush.

The point is to guarantee data gets flushed promptly which only fails when not enough data gets buffered. The timeout is a fallback to bound the flush latency.

I think doing those timeouts transparently would be tricky under the constraints of POSIX and ISO C. It would need to have some cooperation from the application layer

> Some things I didn’t talk about in this post since these posts have been getting pretty long recently and seriously does anyone REALLY want to read 3000 words about buffering?

I personally would.

It depends on the writing.

I've read that sometimes wordy articles are mostly fluff for SEO.

In case of this particular author, those 3000 words would be dense, unbuffered wisdom.

This is one of those things where, despite some 20+ years of dealing with NIX systems, I know* it happens, but always forget about it until I've sat puzzled why I've got no output for several moments.

Buffers are there for good reason, it's extremely slow (relatively speaking) to print output on a screen compared to just writing it to a buffer. Printing something character-by-character is incredibly inefficient.

This is an old problem, I encounter it often when working with UART, and there's a variety of possible solutions:

Use a special character, like a new line, to signal the end of output (line-based).

Use a length-based approach, such as waiting for 8KB of data.

Use a time-based approach, and print the output every X milliseconds.

Each approach has its own strengths and weaknesses, depends upon the application which one works best. I believe the article is incorrect when mentioning certain programs that don't use buffering, they just don't use an obvious length-based approach.

Having a layer or two above the interface aware of the constraint works the best (when possible). Line based approach does this but requires agreement on the character (new line).

Also, it's not just the work needed to actually handle the write on the backend - even just making that many syscalls to /dev/null can kill your performance.

I've ran into this before, and I've always wondered why programs don't just do this: when data gets added to a previously-empty output buffer, make the input non-blocking, and whenever a read comes back with EWOULDBLOCK, flush the output buffer and make the input blocking again. (Or in other words, always make sure the output buffer is flushed before waiting/going to sleep.) Wouldn't this fix the problem? Would it have any negative side effects?

In our CI we used to have some ruby commands that were piped to prepend "HH:MM:SS" to each line to track progress (because GitLab still doesn't support this out of the box, though it's supposed to land in 17.0), but it would sometimes lead to some logs being flushed with a large delay.

I knew it had something to do with buffers and it drove me nuts, but couldn't find a fix, all solutions tried didn't really work.

(Problem got solved when we got rid of ruby in CI - it was legacy).

TTY, console, shell, stdin/out, buffer, pipe, I wish there was a clear explanation somewhere of how all of those are glue/work together.

Here's a resource for at least the first one: https://www.linusakesson.net/programming/tty/

Feels like a missed opportunity for a frozen pipes joke.

Then again...

Frozen pipes are no joke.

> when you press Ctrl-C on a pipe, the contents of the buffer are lost

I think most programs will flush their buffers on SIGINT... But for that to work from a shell, you'd need to deliver SIGINT to only the first program in the pipeline, and I guess that's not how that works.

The last process gets sigint and everything else gets sigpipe iirc

No, INTR "generates a SIGINT signal which is sent to all processes in the foreground process group for which the terminal is the controlling terminal" (termios(4) on OpenBSD, other what passes for unix these days are similar), as complicated by what exactly is in the foreground process group (use tcgetpgrp(3) to determine that) and what signal masking or handlers those processes have (which can vary over the lifetime of a process, especially for a shell that does job control), or whether some process has disabled ISIG—the terminal being shared "global" state between one or more processes—in which case none of the prior may apply.

  $ make pa re ci
  cc -O2 -pipe    -o pa pa.c
  cc -O2 -pipe    -o re re.c
  cc -O2 -pipe    -o ci ci.c
  $ ./pa | ./re | ./ci > /dev/null
  ^Cci (2) 66241 55611 55611
  pa (2) 55611 55611 55611
  re (2) 63366 55611 55611
So with "pa" program that prints "y" to stdout, and "re" and "ci" that are basically cat(1) except that these programs all print some diagnostic information and then exit when a SIGPIPE or SIGINT is received, here showing that (on OpenBSD, with ksh, at least) a SIGINT is sent to each process in the foreground process group (55611, also being logged is the getpgrp which is also 55611).

  $ kill -l | grep INT
   2    INT Interrupt                     18   TSTP Suspended

That makes sense to me, but the article implied everything got a sigint, but the last program got it first. Eitherway, you'd need a different way to ask the shell to do it the otherway...

Otoh, do programs routinely flush if they get SIGINFO? dd(1) on FreeBSD will output progress if you hit it with SIGINFO and continue it's work, which you can trigger with ctrl+T if you haven't set it differently. But that probably goes to the foreground process, so probably doesn't help. And, there's the whole thing where SIGINFO isn't POSIX and isn't really in Linux, so it's hard to use there...

This article [1] says tcpdump will output the packet counts, so it might also flush buffers, I'll try to check and report a little later today.

[1] https://freebsdfoundation.org/wp-content/uploads/2017/10/SIG...

Love it.

> this post is only about buffering that happens inside the program, your operating system’s TTY driver also does a little bit of buffering sometimes

and if the TTY is remote, so do the network switches! it's buffering all the way down.

> I think this problem is probably unavoidable – I spent a little time with strace to see how this works and grep receives the SIGINT before tcpdump anyway so even if tcpdump tried to flush its buffer grep would already be dead.

I believe quite a few utilities actually do try to flush their stdout on receiving SIGINT... but as you've said, the other side of the pipe may also very well have received a SIGINT, and nobody does a short-timed wait on stdin on SIGINT: after all, the whole reason you've been sent SIGINT is because the user wants your program to stop working now.

Maybe that's why my mbp sometimes appears not to see my keyboard input for a whole second even though nothing much is running.

AFAIK, signal order generally propagates backwards, so the last command run will always receive the signal first, provided it is a foreground command.

But also, the example is not a great one; grepping tcpdump output doesn't make sense given its extensive and well-documented expression syntax. It's obviously just used as an example here to demonstrate buffering.

> grepping tcpdump output doesn't make sense given its extensive and well-documented expression syntax

Well. Personally, every time I've tried to learn its expression syntax from its extensive documentation my eyes would start to glaze over after about 60 seconds; so I just stick with grep — at worst, I have to put the forgotten "-E" in front of the pattern and re-run the command.

By the way, and slightly off-tangent: if anyone ever wanted grep to output only some part of the captured pattern, like -o but only for the part inside the parentheses, then one way to do it is to use a wrapper like this:

    #!/bin/sh -e

    SED_PATTERN="$(printf '%s\n' "$GREP_PATTERN" | sed 's;/;\\/;g')"

    grep -E "$GREP_PATTERN" --line-buffered "$@" | sed -r 's/^.*'"$SED_PATTERN"'.*$/\1/g'
Not the most efficient way, I imagine, but it works fine for my use cases (in which I never need more than one capturing group anyway). Example invocation:

    $ xgrep '(^[^:]+):.*:/nonexistent:' /etc/passwd

> grepping tcpdump output doesn't make sense given its extensive and well-documented expression syntax.

I dunno. If doesn't make sense in the world where everyone makes the most efficient pipelines for what they want; but in that world, they also always remember to use --line-buffered on grep when needed, and the line buffered output option for tcpdump.

In reality, for a short term thing, grepping on the grepable parts of the output can be easier than reviewing the docs to get the right filter to do what you really want. Ex, if you're dumping http requests and you want to see only lines that match some url, you can use grep. Might not catch everything, but usually I don't need to see everything.

Learned two things: `unbuffer` exists, and “unnecessary” cats are just fine :-)

I like unnecessary cat because it makes the rest of the pipe reusable across other commands

Eg if I want to test out my greps on a static file and then switch to grepping based on a tail -f command

Nice article. See also: https://www.pixelbeat.org/programming/stdio_buffering/

It's also worth mentioning a recent improvement we made (in coreutils 8.28) to the operation of the `tail | grep` example in the article. tail now notices if the pipe goes away, so one could wait for something to appear in a log, like:

    tail -f /log/file | grep -q match
There are lots of gotchas to pipe handling really. See also: https://www.pixelbeat.org/programming/sigpipe_handling.html

Wow I had no idea this behavior existed. Now I’m wondering how much time I’ve wasted trying to figure out why my pipelined greps don’t show correct output

one of the reasons why i hate computers :D

Also made a post some time ago about the issue: https://world-playground-deceit.net/blog/2024/09/bourne_shel...

About the commands that don't buffer, this is either implementation dependent or even wrong in the case of cat (cf https://pubs.opengroup.org/onlinepubs/9799919799/utilities/c... and `-u`). Massive pain that POSIX never included an official way to manage this.

Not mentioned is input buffering, that would gives you this strange result:

  $ seq 5 | { v1=$(head -1); v2=$(head -1); printf '%s=%s\n' v1 "$v1" v2 "$v2"; }
The fix is to use `stdbuf -i0 head -1`, in this case.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact
