I remember seeing a Github bot a couple weeks ago that strips out whitespace and adds a .gitignore file to a repo (I also remember this really rubbing some people the wrong way). This search indicates that it would probably be useful to have a linter bot running on Github for all the popular languages. It would find syntax errors, common mispellings, and compilation issues, and then submit pull requests to fix the issues.
I have no time to work on something like this myself, but I'm sure a lot of people would find it useful, especially if it acted as a "first defense" before deployment. Curious what other HN'ers think about this.
Hey, some of us actually use trailing whitespace! :-)
I use it to create useful indentation guides in Komodo. If the whitespace is stripped away, the indentation guides have gaps where there's a blank line in an indented block.
Maybe Komodo could use a different method to decide where to draw the lines that didn't depend on trailing whitespace. It looks like Sublime Text 2 has a different approach for its indentation guides - maybe the Komodo guys should look at that. But in the meantime I'm using Komodo as it actually works today, so the whitespace on blank lines is important. Let me keep it, please? :-)
I wouldn't mind stripping out trailing whitespace on nonblank lines - that wouldn't affect my precious indentation guides.
But wait a minute, what about Markdown? Two spaces at the end of a line to get a <br>, right? Does the bot skip Markdown files?
Finally, for the folks who have automatic whitespace removal in their editor settings... Please be careful: With this setting, you'll be very likely to make a commit that includes both significant code changes and a mass of whitespace changes in the came commit.
Those kinds of changes should be separated: one commit for the code itself, and a separate commit for the whitespace with a comment like "Whitespace cleanup, no code changes."
This allows people who diff the revision history to diff with whitespace significant most of the time, the only exception being when reviewing a whitespace-only change.
If you create a JS macro that is triggered on file open, with the contents "komodo.view.scimoz.indentationGuides = komodo.view.scimoz.SC_IV_LOOKBOTH", it should give you the indentation guides without the whitespace actually there. http://www.scintilla.org/ScintillaDoc.html#SCI_SETINDENTATIO... has some explanation of the possible values.
Is there any equivalent of the robots.txt standard for public code repositories? Being able to opt-in to certain bots might be helpful (opt-out being the default, of course).
Ah, aggressive trailing whitespace removal. That I can completely get behind. I've already got command-s bound to a custom macro that strips trailing whitespace in TextMate for myself and my co-workers; but this would be an even more inclusive solution.
This plugin is better than using the builtin vim 'list' command because it
doesn't show an annoying highlight while you are typing in insert mode at the
end of a line.
If you use Vim from the terminal to edit text (which I do), trailing whitespace shows up as big white blocks. It's slightly visually distracting, but it's really just an OCD thing.
Perhaps it would be more productive to advocate the use of local pre-commit hooks. Git makes it very easy to configure validation locally long before anything gets sent to Github.
Would be nice if Github provided better documentation and a selection of validation templates to include in new projects. This would better leverage the power of Git and its distributed nature than a bot running on Github.
I actually just replied with a suggestion that Github should implement some built-in functionality for simple error-checking. I think it's a great idea. It would be really helpful if you got a simple little list of notifications for a commit indicating possible error points.
I noticed another problem is that the highlighter will select the language name as well as the term which means <?php and (c) Copyright are shown instead of the actual mistake.
C# I can understand being so low, since it's almost always written in Visual Studio or MonoDevelop (both of which provide autocompletion). But how is JavaScript the next lowest?
Hmm, I guess it's because length is a commonly used function in Javascript, but not in other languages. In Python you do len(list) instead, so the word length is more likely to appear in comments and therefore less likely to be corrected.
Because the built-in .length property is frequently used, and will fail if misspelled.
Whereas C and C++ programs tend to have a lenght operator implemented by the programmer, and from there the error gets snowballed by IDEs and debuggers.
C# is also a compiled language, so you wouldn't be able to get anything to run with a typo like that hanging around. I find it surprising that there are so many commits making that mistake!
Being a compiled language isn't sufficient to prevent "no method" errors. It's completely possible for a compiled language to define methods at runtime or use duck typing.
Odd that compiled languages (C, C++, Java) are higher than some interpreted languages (PHP, Javascript). Of course, the search will match comments as well as code, so it may just mean they have better comments.
I can't think of anything in C or C++ that uses length off the top of my head. Size and Len, sure, but nothing that's length.
I would guess that most of the spelling errors get propagated through autocomplete. That's how most of the spelling errors in my code get there anyway.
For the record, Github's search index is wayyy out of date sometimes. The second user here is me and I deleted that user like two years ago: http://cl.ly/0y271f0T3G0X2J1L022E
Same here, I contacted support two times in two years and they said "we're working on it". Obviously, that isn't true and they just don't care about the outdated search index.
In JavaScript, if you check for a non-existent property on a variable (e.g. aVar.lenght vs aVar.length) it will return "undefined". So people often rely on this behaviour to check if something is an array or not (no comment on whether this is good or bad), with:
if(somethingThatMightBeAnArray.length){
// do things with array
}
So misspelling of length can be making a lot of code out there behave in an unexpected way.
The same pattern is widely used to test whether an array-like object is empty. Since a length of 0 is also "falsey", it evaluates as false when the array has no elements. A typo in this case would result in the tested array always being "empty".
In a static language this would be flagged as an error. I assume something less than ideal happens in languages such as Ruby.
I once worked at a company where a very early piece of code had a typo "properites" instead of "properties". This misspelling became institutionalized, and was used throughout the codebase because it was deemed too expensive to fix. And this was with a static language (with good IDE refactoring support)!
In ruby, and I think most dynamic languages, this type of typo is likely to raise an exception. It could hurt, but a simple test run is likely to discover it.
The way javascript (which is what is linked) handles this, as amirhhz described it, leads to silent errors which could turn out a lot worse.
There's actually no exception, it just returns "undefined" and the if statement fails. That's why it's such a deadly bug -- no exception, and path dependent. Combine that with the async nature of JS and it's going to be a long night tracking that one down
The whole software infrastructure was a scary house of cards. They were afraid that there was unknown code that might be depending on it. For example RESTful services in other departments that were not under our immediate control.
Yes, it was exactly like that- lots of code had grown around the "bug", and it was not immediately obvious what other software had come to depend upon it. "Little hairs", as Joel might say.
Obviously you have never worked on a code base with 1000s of developers. If you edit almost every file then basically everyone needs to stop writing new code while the change is made. Otherwise the merges others have to do is going to be a disaster.
Well, then problems with such process are pretty well documented[1]. For comparison, at Google global refactorings are pretty common and usually painless, there are even custom tools to support such changes (push them through code review, ensure no tests are broken etc.)
I know the Microsoft process all too painfully. RI,FI,RI,FI,RI,FI,RI,RC,RTM,Ship It,Repeat.
But as to the "usually painless" at Google. So when does that pain happen?
Can you take me through the following scenarios: change a variable name, change a base class name that lots of people extend from, file renames?
How do you go about refactoring? Do everything at once? Breaking it into pieces? Do file rename then variable and base class renames? Or smallest piece at a time?
Once the refactoring is complete how do you communicate to others the changes so when they merge the code in they don't get too messed up? Or worse undo something in the refactoring. (Also follow up is it better to do the big refactoring so there is the one big merge or a bunch of little refactorings and lots of little merges across the spectrum).
I guess the code change isn't the problem. It's making a big change and getting people on the same page is much harder. Especially when their are varying degrees of skill and experience on a project. And it's this stuff that is painful and leads to not wanting to do big refactorings at a lot of shops.
Hacker News has a short attention span and this probably won't be seen by many, but I'll try to answer your question nevertheless. There are several factors I'd like to mention.
1. Most importantly, the version control head is always the point of reference, and the burden of merging is on people who keep long-lived pending changes. This means that conflicts are resolved as soon as possible by a person who actually knows the context, instead of being postponed until a dreaded merge window. Ultimately, a programmer pursuing refactoring is only responsible for making sure it works on the head, and should announce the change so that others are prepared for merging.
2. There are some huge code bases at Google, but nowhere near the size of Windows. On the other hand, I'm sure that even Windows has to be separated into more or less decoupled components. When I doubted that you work on the same code with thousands of other programmers I was thinking in terms of components, not final products.
3. Cultural aspect shouldn't be disregarded. Code hygiene is encouraged at Google, and some people volunteering their 20% time to help with that. Moreover, there are some custom tools that make global refactorings much easier and safer.
EDIT: Eh, apologies if this sounded like chastising -- I didn't mean to. As a developer who's been trapped in "Windows-mindset" for many years, I wanted to try to inspire other Windows devs to try to use *nix-based solutions even if their only option is Windows development. Cygwin is in a very good spot right now -- it's achieved so much acceptance that even the most hardened institutions now allow it to be installed.
I had this problem as a junior dev when my english was weaker. The problem stems from that 'height' is spelled with 'ht', but width with 'th'. Since one often write those words in conjunction, it is easy to mix the endings up. If you're then a non-native speaker and don't run spellcheck on your code, you might end up writing 'lenght' and 'heigth' quite a few times, I know I did :)
My experience is more with languages that are typically compiled and would report this error as an error fairly early on, so the coder would correct it long before checking the code in.
What's the trade-off by having "undefined" returned instead of having an error reported as soon as the code is loaded?
It prevents you from later defining a 'lenght' method and using it at runtime without a recompile.
For core methods like 'length', it seems silly to think that you'd want to redefine it. And indeed, it's usually counterproductive - that's why any experienced JavaScript dev will have coding conventions like "Don't muck with the prototypes of built-in objects."
But at the application layer, this can be really useful. Imagine you're adding a new field to a message deep in the storage system, and then you want to pass that along to a template in the rendered HTML. It's really useful to be able to do this without recompiling & restarting each individual server between the backend and the frontend, and just edit a few template files and have them automatically pick up any changes to backend data formats.
Ditto adding a new database column, if you're using an RDBMS - it's pretty handy to have your model objects instantly reflect the new field, instead of needing to manually add accessors to each of your model classes. Rails and Django are built on this principle.
Also, you have a versioning problem with statically-compiled code in a distributed system. Imagine that you add this new 'lenght' field to a backend message, and add it to the frontend, and they both compile & deploy. Now imagine that a message from an old backend hits a new frontend (it's not possible to upgrade a whole distributed system at once without downtime). What does the new frontend do with it? It needs a piece of data, but the backend had no idea that it had to provide that piece of data. The only thing it can do is return the equivalent of 'undefined'.
In C++/Java code, you usually deal with these by inventing frameworks. Google code, for example, is littered with
if (msg.has_new_field()) {
run_long_complicated_ui_display_routine(msg.new_field());
} else {
fall_back_to_old_behavior(msg.old_field());
}
checks. If you use a more dynamic language like Python, you can use language mechanisms to represent undefined values or fields that are defined at runtime. If you use a static language, you're stuck mimicking them with hashmaps and null.
Whether your language is compiled is not the issue, it's how you model objects and calling methods on them. In smalltalk and other languages that take a message passing approach doing a.b() sends a message "b" to object a, and the object can do anything it likes with that.
Now the normal (and optimized) route is to find the method on a’s method table and then call that, but if a doesn't have that method then a second method may be called to allow this to be handled. Once you have that sort of mechanism you can make ORM libraries that dynamically examine a schematic at run time and generate accessor methods only as they are needed, decorators, proxies and many other patterns become wonderfully simple, and there are often many more opportunities for meta-programming at run time.
The downside is of course that it becomes harder to find errors when writing or compiling, but tight integration of your development environment with your runtime can help with this.
This reminds me of a US company I worked with that outsourced some of their service layer work to a company with heavy European influence. As a result, API methods also had the spelling of certain words eg. getColour() or getFavourites(). Good times.
In the LaTex editor that I'm using (WinEdt), I have a custom color highlighting that marks \rigth and \heigth in red+bold+strikeout, so I don't have to wait to compile and see a strange error to spot the mistake.
It'd be great if Github would scan your code for errors like these and just let you know they exist (in case you didn't want them to, which I would assume you wouldn't for the most part).
Well it is basically a list of bugs -- and a rather long one too.
Of course there are rare cases where "lenght" is a variable and that name is used in every instance but mostly, these are bugs in code that we all use.
I have no time to work on something like this myself, but I'm sure a lot of people would find it useful, especially if it acted as a "first defense" before deployment. Curious what other HN'ers think about this.