29

Per kristinalustig's request, I'm posting a standalone MSE report for all the different languages being guessed incorrectly via the new code block labeling feature that rolled out as part of the copy-with-attribution feature this week:

Could you please create a separate meta post bug report where these misidentified language issues can be collected so that we can triage separately? It's probably a larger issue than we can tackle immediately and I don't want it to get lost in the shuffle.

Here is a list (with an example of incorrect) of all the recorded different languages that are incorrect:

Please feel free to edit this table to include examples of languages not yet listed here (only one entry per language, please)

27
  • Note--I'm told ColdFusion uses 'default' so it will guess wildly for every code block in a tag by that title. It may not be a good example to use for this report, but I'm not a ColdFusion expert so I'll defer to people who know that language. Commented Nov 6 at 17:32
  • 7
    I don't think building a table is worthwhile. It's a long-standing issue with the syntax highlighter when it is forced to guess the code. It's (mostly) not down to the language but the tags used. If more than one tag defines a syntax hint, then any code block without an explicit hint for it, will fall back to the guess mode. And when guessing, it doesn't matter what the language in the code block is. JS code can be detected as Java, or as Lisp, or as other stuff, depending on what the code present is. Commented Nov 6 at 17:34
  • And yes - for tags where the type hint is "default" the same applies. ColdFusion and .NET seem to do that. Commented Nov 6 at 17:34
  • 1
    The powershell tag defines the syntax hint as lang-bash. But I don't think there is specific PowerShell highlighting. Using lang-powershell seems to just resolve as lang-none. Commented Nov 6 at 17:37
  • 1
    @VLAZ I admit the table will get exhaustively long eventually but I'm hopeful given the "in-your-face" nature of the bug now that this new copy feature exists that the table existing and getting bigger and bigger will either push staff to remove the labels or actually fix the underlying problem. Commented Nov 6 at 17:39
  • 1
    There are numerous Python-related questions that don't have a Python tag, neither generic nor version-specific. Eg, Pandas 23,494, Django 142,155, Numpy 7,571. Though with Django, that's kind of understandable: meta.stackoverflow.com/q/320277 Commented Nov 6 at 18:06
  • 1
    Quite an abundance of SCSS and others here: stackoverflow.com/questions/26648227/… (I think it is supposed to be Java, however, the tags/question does not specify) Commented Nov 7 at 21:10
  • Question about Language tagged in the first column: is this the 'dominant' tag (in case of more than one language, for example javascript+html)? Commented Nov 9 at 11:22
  • 1
    The issue is wider than incorrect recognition of tags from the question. The syntax highlighter may simply not have the "correct" language. It will use the wrong one even when you tag your blocks explicitly, which I always do. Commented Nov 9 at 15:27
  • 1
    @GSerg To clarify, it will only use the wrong language when you tag your code blocks with a language highlight.js doesn't have a language highlighting rule set for. If you explicitly tag your blocks with, e.g. JS, then they will correctly display as JS even if the question has a c++ tag only. Commented Nov 10 at 14:09
  • 2
    Reproducing this comment: "@kristinalustig when will the fix be fixed? It's literally wrong ~80% of the time in my >9k answers. I cannot afford to go back and fix. See for e.g. this mis-identifies c++ as ini, cpp, scss, rust, rust, less, php, php, ruby, cpp, cpp, scss, less, dart,lisp, cpp, rust, php .... shell, haskell. That's all in a single answer tagged c++" ¯\_(ツ)_/¯ Commented Nov 12 at 15:20
  • 1
    @Wolf Yes, it would be important to handle that, but I think that's a broader discussion, and a secondary one; not only would it be better served by something like a 'parent language/technology' tag (aka tag system redesign) but also I think just getting it right for a single tag first is important, as most questions aren't about multiple languages... once they have it working for Qs with one language tag, then they can look at how to handle multiple. Commented Nov 12 at 15:23
  • 1
    @AJM I rolled back your edit because that's not an example of the problem here; in fact it's actually working correctly there. The problem here is when a question is tagged with a particular language tag, or a tag that sets a certain language for syntax highlighting, and then the code blocks within that question automatically use completely different languages for syntax highlighting. In your example the blocks are using C++ syntax highlighting for the Visual C++ tag, which is correct/expected. If those code blocks aren't in C++ they should use lang-none or whatever lang is appropriate. Commented Nov 21 at 20:44
  • 1
    {This question](stackoverflow.com/q/44836551/5320906) has a (correct) Python tag but the code in the question (And one answer) is marked as scss; another answer marks the code as kotlin; I added a Python fence to the accepted answer while fixing another issue but but was originally marked as perl. Commented Nov 22 at 12:17
  • 1
    @AJM It's really not the same thing, I promise you. Putting a language tag on a question which contains code for said language, and then the system assigning a language other than the language/language tag used, is not the same thing as putting a language tag on a question, and then intentionally including code blocks in a language other than the language that was tagged. The latter cannot ever be handled by the system because it's a human intentionally 'tricking' the system into being wrong, whereas the former can and should always be handled correctly by the system. Commented Nov 24 at 18:52

1 Answer 1

7

I strongly request that the language name be removed from the new copy block header, and that the attribution lines be given as plain text, not comments.

We now have a vast number of code blocks on Stack Overflow that are labelled with the wrong language. This can be confusing, especially to new coders, and it looks bad to the experts. It's almost like someone's vandalised these posts, adding random language names...

Adding the language name to the copy block and using it to choose the comment style is a nice idea, in theory. However, the error rate caused by incorrectly guessing the language is just too high. It was ok when the language guessing was only used to choose the code highlighting style. Incorrect guesses were annoying, but they didn't actually cause any significant problems.

On a related note, many SO answers include command lines related in some way to the main answer code. Answers may also include sample output in code blocks. Many authors didn't bother giving a language hint to these supplementary code blocks, but now there's a good chance that they have an incorrect language attached to them.

It's simply not necessary to put the attribution lines into comments. Just ensure that they're clearly separated from the actual code in the code block. Even new coders know how to turn plain text into comments, since comment syntax is one of the first things you learn in any language. ;) (Incidentally, in Python, attribution should normally go into the code's docstring, not a comment).

In the current copy button implementation, if it can't determine the language (and hence the comment style) the code block doesn't get a copy button. That's sub-optimal.

On a related note, there's a big problem with code blocks containing JSON. Although JSON syntax is a subset of JavaScript, comments are not permitted in JSON.

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.