|
From automatic code recognition to fully automated code snippet pasting; no more missing PRE tags! category 'tool', language C#, created 11-Dec-2009, version V2.2 (24-Feb-2010), by Luc Pattyn This article also appeared on the CodeProject. |
|
License: The author hereby grants you a worldwide, non-exclusive license to use and redistribute the files and the source code in the article in any way you see fit, provided you keep the copyright notice in place; when code modifications are applied, the notice must reflect that. The author retains copyright to the article, you may not republish or otherwise make available the article, in whole or in part, without the prior written consent of the author. Disclaimer: This work is provided |
Wouldn't it be nice if you could just copy and paste any code snippet you like into a CodeProject edit box without worries, and get your PRE tags for free, including the required language code? and get your < and > signs properly HTML-encoded; and while we're at it, also get your tabs reduced so the indentation is minimal? And of course get a literal paste when the clipboard holds regular text rather than code.
This article presents a way to identify the main programming language, if any, in a text snippet. It could be used in all CodeProject editor boxes, whether editing an article, a tip&trick, a quick answer question/answer, or a forum message. The article also provides a test bench so one can experience the ease of use, and judge the quality of language recognition.
Programmers don't have a difficult time telling a C# method apart from a VB.NET method or from a chapter of plain text; even a smaller code snippet is often sufficient to tell what language it is or is not. Compilers can easily parse source code in the language they support, and judge how well the source adheres to the required syntax, provided they are offered a compilation unit, i.e. an entire file, with proper include/using/Imports statements and everything, and without too many errors. Our goal is somewhat different though: we want to recognize snippets, not files.
Here are some of the issues that could arrise, especially with buggy code that we encounter daily:
As a result of a little research I did on this topic, I decided I would not go for one or several full parsers, instead I would read the input once and tokenize it slightly, perform some statistical analysis, and use several criteria to determine the match or mismatch between the content of a snippet and the rules of each of a set of programming languages; each criterium is awarded some score, and gets normalized to be proportional to the number of lines in the snippet. Then all partial scores are summed to get the final score for that language. These are the criteria I have chosen:
_ - $ #).
As soon as some other character is encountered (such as parenthesis, equal sign, ...) the scan is suspended for that line.
All identifiers up to that point typically should be keywords, except for the last one which could be the name of a variable
or method.
Each such identifier that matches a keyword is awarded some positive points, which makes this score proportional to the size
of the snippet.
Example: in private int count=a*b; we would discover three identifiers and expect the first two to be keywords.
BTW: some languages, in particular VB.NET, have a different ordering, as in Dim count As Integer = 4, so we have
to allow for that too.
Once all the languages have been scored, the highest score determines the result; and the distance between the topmost
two scores is converted into a trust value (0=low trust, 99=high trust).
If the highest score is too low, or the trust is too low, a recognition failure
is signaled by returning null, which means "consider it text, not code";
otherwise the best matching language is returned.
Code | Language | Comments |
cs | C/C#/C++/C++.CLI/Java/JavaScript | syntax is similar |
vb | VB/VB.NET/VBscript | syntax is similar |
xml | XML/HTML/ASP.NET | syntax is similar |
sql | SQL | All keywords in T-SQL |
midl | MIDL |
|
php | PHP | All keywords in PHP4+PHP5 |
css | CSS | A few keywords only (no coloring on CP) |
msil | .NET IL |
|
asm | x86 assembly | CP list (which is not complete) |
Some code snippets can not be identified to the exact language; e.g. the similarity between C, C++, Java and C# is such that lots
of code snippets would look identical in some or all of those languages. That is why we use some "language groups",
which implies the keyword lists of those languages should be merged for syntax coloring; that is no big deal as those languages
have very similar keyword sets anyway, so the worst that could happen is some identifier getting colored although not
really a keyword in the actual language (e.g. int public = 4; in a C snippet).
The selected criteria seem to be sufficient for identifying the language or language group for most practical snippets. However as we are using tokens and statistics rather than full parser trees, there may be ways to occasionally get a skewed result; to name a few:
int multiply(int a, int b) {
/*
If Me.InvokeRequired Then
Me.Invoke(logHandler, New Object() {threadID, s})
Else
listBox.Add("multiply")
End If
*/
return a*b;
}
[System.FlagsAttribute()]
public enum DiffSectionType : byte
{
Copy = 0
,
Delete = 1
,
Insert = 2
,
Replace = 3
}
In the rare situation where code in one language gets recognized as code in some other language, the worst that can happen is that the syntax coloring will not be adequate, as it would be based on the wrong list of keywords. However when that happens, the author can always edit the language code in the opening PRE tag to fix things. Remember, the logic presented here should be applied only while pasting; once pasted, the text can still be edited, e.g. HTML tags could be added to apply bold or italics to some parts of the pasted code.
In some cases, snippets have to be handled as text, even when they look like code, or vice versa. So it is probably wise to provide a three-way choice (auto, code, text) to enforce a specific behavior; obviously I would recommend "auto" as the default. "text" could be used to copy-paste an existing code block, with PRE tags, and possibly bold and italics tags, already present. And "code" could be used for anything that does not get recognized as code, but deserves the background color and the non-proportional font that come with PRE tags.
The environment used is the Microsoft .NET Framework (version 2.0 or above) and the C# programming language. The test bench consists of a Form with:
Note: the major Controls can easily be resized as they are anchored to a couple of SplitContainers.
The major parts of the code are:
LanguageXxx classes that inherit from the base class Language
holds the characteristics of a single programming language or a group of languages with similar syntax;
a static List AllLanguages keeps track of all instances,
and a public CreateAllLanguages() method creates and initializes those instances.
From a user's perspective, and assuming all required languages have been provided,
all that is relevant is the Name property which
returns a string with the language name, such as "cs" for C#.LPCodeRecognizer: a static class with one important method:
// Gets the Language used in some text.
// (input) string text: the text or code snippet, with line breaks (any of \r, \n, \r\n), not HTML-encoded.
// (input) List languages: the language descriptions.
// (output) int trust: a number in the range [0,100] indicating how trustworthy the result is
// (return) Language: the Language describing the main programming language, or null when unknown/undecided
public static Language Recognize(string text, List languages, out int trust);
Most of the code is simple character and string manipulation;
regular expressions are not used as I needed a lot of state, and wouldn't know
how to handle that efficiently using the Regex class.Untabber.ReduceLeftTabs() method is optionally called before text gets pasted into the TextBox. It checks
the text line by line, and determines how many tabs could be removed at the left hand side, then removes them.
It ignores empty lines, and does not remove any spaces, as it does not know how many spaces correspond to a tab in the
original source.
The net result is typical code snippets get moved as far to the left as possible,
reducing the overall width when pasted inside PRE tags. The test bench has a checkbox to optionally disable this feature.HtmlConverter.ToHtmlDocument() method converts the Textbox content to an HTML document for display
in the WebBrowser and optional dumping with the "Dump HTML" button. The converter mimics to some extent what a
CodeProject editor window does; most importantly it turns newline characters into break tags. It does not
filter unacceptable words or HTML tags, nor does it deal with smileys or hyperlinks.
It's only purpose is to give a simple preview in the WebBrowser.A few simple rules suffice to give an extremely good text/code decision and a good estimate as to what the main programming language is in a code snippet. Applying LPCodeRecognizer to the paste action in a message editor can yield fully automated code pasting: adding PRE tags, recognizing the language, HTML encoding, etc. are all automated. And whenever the author isn't satisfied he can still edit the text or code afterwards.
Let us hope something like this gets built into all the CodeProject editors (starting with the forum editor), and missing PRE tags soon become a thing of the past!
Perceler |
Copyright © 2012, Luc Pattyn |
Last Modified 21-May-2025 |