$70,000,000(USD) – Seventy Million Dollars, an absolutely meaningless estimate of the cost of an average mid-sized in-house enterprise business application. I say meaningless here for two primary reasons. The first is that this number has almost no relation to the actual cost that an average company would spend to create the software, and second because it has absolutely no relation to what an average business would by or could sell the software for. So where does this perfectly meaningless number come from? It is an estimate based off of one of the mostly ill-conceived metrics in use in the software industry today, Lines of Code (LoC). There is an interesting insidiousness to the measurement, it’s one that everyone seems to recognize as inherently flawed, and yet it’s popularity never seems to waver. In order to really address this issue we need to first understand why Lines of Code is such an ineffective metric for measuring a codebase; once we understand why it’s so bad, it is necessary to look at why we keep using it- what information do we hope to convey by using this metric, and finally, what other terms and metrics might be useful in conveying this information, without the pitfalls of the Lines of Code measurement.
The LoC metric is flawed in numerous ways, both subtle and obvious. To begin, I wish to address some of the most problematic aspects of using LoC as a valid code metric. The first and most obvious of these reasons is that comparing the lines of code between two different applications becomes completely meaningless if those two applications are written in different languages, or even in the same language for different platforms. Languages vary in their verbosity, an algorithm that could be implemented in 20 lines of perl might require 50 lines of C, 15 lines of lisp, or a couple of hundred lines of assembly. Even comparing C with C, an application written for a desktop Linux system with gcc may be able to make use of libraries to write in a dozen lines what could day hundreds of lines when written for an embedded system without those libraries available.
The mention of libraries brings up an entirely different flaw with the LoC metric. The fundamental question of which lines do you count? Linux is easily over 5 million lines of code, but how many people count the kernel in their measurements? What about all of the code in glibc? How many java shops count the JVM in their LoC estimates? What about some obscure library? In general, most people wouldn’t include the kernel, VM, or third party libraries in their LoC estimates, but what about internal libraries? At what point does that code becomes close enough to count in the Lines of Code metric?
Let’s take a detour for a minute and assume that we have a clear cut line to decide what gets into our “Lines of Code” count, and that we have agreed on some common reference platform, or at least that we have established what our platform is and our audience is technically savvy enough to be able to understand the ramifications of that platform on the LoC metric. In this case, what information do we hope to convey with the measurement?
In general, I would posit that there are two primary pieces of information that we are trying to convey when we say how many Lines of Code we have. First, giving lines of code is an attempt at putting a capital value on the codebase; the implicit assumption here seems to be that there is some ill-defined but definite dollar figure that grows proportionally to the size of the codebase. The second piece of information that we try to communicate with the LoC measurement is the complexity of the application. The assumption here is that a program that has 1 million lines of code must have been harder to write, solve a more difficult problem, or have more features than some smaller program. Once again there is also an implicit assumption that it takes a better development team to write and manage a bigger codebase.
So, if we have achieved these feats of standardization that let Lines of Code be an apples-to-apples comparison, how does it serve to communicate the complexity and value of the code? The answer is unfortunately poorly. The reasons for this can be seen quite easily on a closer examination.
So let us first look at value, and the assumption that the number of lines of code in an application is related in any way to the value of the codebase. The problem here is with implicit assumptions. In a business setting the value of code can be understood as being derived from A: the cost of creating that code, and B: the value the code has to the business. Although it would be easy to conclude that both of these are obviously related to the number of lines of code, all other things being equal, this is a deceptive and incorrect conclusion. The reason for this is related to the second piece of information we are hoping to convey with the lines of code measurement in the first place. Complexity.
A rather famous quote attributed to Albert Einstein is “Make everything as simple as possible, but not simpler”. Complexity is often an unavoidable consequence of the needs of software, but complexity in software is fundamentally neither valuable nor desirable. Complex software suffers from the unfortunate combination of having more bugs than simple software, and being more difficult to fix when bugs are found. It’s also more difficult to make changes to complex software, and as the complexity of software increases, the time it takes new developers to get up to speed vastly increases as well. Once software reaches a certain level of complexity, if a business ever finds it self in the unenviable position of needing to turn the project over to a new developer who was not part of the evolving codebase, they may find that the system has become so tangled and complex that any new developer without historic perspective on the codebase will find it nearly impossible to begin work without reimplementing large swaths of the code.
In many cases however, complexity is more than simple a hurdle to overcome in order to create software, it’s often a symptom of a lack of understanding of the problem. Even very complex problems often have elegant solutions, or at least relatively more elegant solutions, when they are fully understood.
The consequence of these facts is that using the lines of code metric to measure the value of software is almost impossible without first evaluating the complexity of the software. When you look at the number of lines of the code, and consider the complexity of the code base, and further factor in the necessity of that complexity, you may begin to approach a useful metric for measuring a codebase. At that point however, what you’ve really done, is said “what other metrics can we use to measure the code, so that we can explain what lines of code means”. Knowing the number of lines of code in your codebase really has no meaning if you know the features, and can describe the level of complexity and quality of the complexity of the code. In short, for software, quantity means nothing compared to quality; and LoC can tell you the number of lines of code in the software, but not the quality of code.