CS-5630/6630 | Visualization | Fall 2014
Week 5: Codebases Millions of Lines of Code - Printable Version

+- CS-5630/6630 | Visualization | Fall 2014 (http://www.eng.utah.edu/~abigelow/bulletinBoard)
+-- Forum: CS-5630/6630 | Visualization | Fall 2014 (/forumdisplay.php?fid=1)
+--- Forum: Design Critiques (/forumdisplay.php?fid=3)
+--- Thread: Week 5: Codebases Millions of Lines of Code (/showthread.php?tid=31)

Pages: 1 2

Week 5: Codebases Millions of Lines of Code - rahduro - 09-25-2014 07:31 PM

[Image: 1276_lines_of_code5.png]
Source: http://www.informationisbeautiful.net/visualizations/million-lines-of-code/
I came across this interesting visualization where the author compares the lines of code in various codebases over time. I am sure, you will encounter some very unexpected results.

One particular thing that is very interesting about this visualization is how the author has captured more than one dimensions in y axis effectively and also used it to compare evolution of the same software over time (see the grey arcs with percentage label). We can see, colors has been used to visually encode different categories. One subtle thing that caught my attention is the way author has used fine lines inside each interval for millions, which makes it quite easy to interpret the actual value.

Some of the real surprising stats that this visualization revealed is that a HD Video player in XBOX has more lines of code than PS-CS6. Not only that, a modern high-end car software has approximately 5 times the LOC of a F35 fighter jet or 2 times as much as Large Hedron Collider. And as always it is hard to believe how much evolution operating systems have gone through in last 40 years, from about 10000 LOC in Unix at 1971, we have reached close to 85 million in Mac OSX Tiger.

There are two things that I find very unconventional here. First, the author has used the same x-axis to represent thousands and later several millions without any scaling. Second is the use of multiple y-axis bars for one entry (see mouse genome and healthcare.gov). These are not confusing for me, but I would like to know what others think.

RE: Week 5: Codebases Millions of Lines of Code - u0788158 - 09-27-2014 06:43 PM

I also like the faint lines marking millions within the bars. I think the gray arcs are confusing. I saw this visualization several times before I realized that the arcs marked growth in percent. The use of multiple bars for Mouse Genome and healthcare.gov seems to be an attempt to emphasize the massiveness of healthcare.gov.

Some things in this visualization are very unclear. What does "APPARENT Size" mean for healthcare.gov? Does this mean it is an estimate? How exactly do you compare genome pairs to lines of code? Is one pair equivalent to one line of code?

As rahduro pointed out, it might seem strange that a consumer car has more code than the F-35. This raises the question of how lines of code is being calculated. It also helps the visualization's purpose in showing that a lot of functionality does not necessarily mean a lot of code.

Ultimately, the visualization is very visually appealing, but leaves too many questions to be a reliable source of meaningful information.

RE: Week 5: Codebases Millions of Lines of Code - mtbkapp - 09-27-2014 08:40 PM

I assume how measure lines of code has be argued about by programmers for decades now. I really like this vis. The numbers on the gray arcs are hard to read but I think thats the trade off of not having them occlude other things. There are 6 bins for color hue which is good, not too many. The x axis scale change from the first section to the rest is kind of odd because I expected the scale to shift again in the next section but it doesn't. But after that is discovered then it didn't matter to me. Also most of the data uses the same scale anyway it's just the top section that is different. The faint gray lines to denote millions along with the gaps to denote 10 million is helpful. The idea of using lines of code to compare a systems is fun but meaning is probably nuanced. Good thing the goal of these critiques is more critiquing the vis more than the data it expresses. Favorite thing to compare is the difference between Windows Vista and Windows 7. Less code in the later and it's a better product in my opinion. Thanks for posting this.

RE: Week 5: Codebases Millions of Lines of Code - u0862992 - 09-28-2014 08:57 AM

Bar chart was a great choice for showing the massive difference between each projects. This definitely highlights some very unexpected results. I liked how the visualization started from the smallest project and evolve from there. However I find arches quite distracting. Also, some of the projects list year of release(?) under the title of the project but I don't see why some projects have that information while others do not.

RE: Week 5: Codebases Millions of Lines of Code - u0923385 - 09-28-2014 10:08 AM

I agree with others that the bar chart was a great choice. As u0788158 mentioned, the manner in which the bar charts were constructed works well for portraying subtotals (e.g. millions and 10's of millions).

The hues selected work well also and were easily distinguishable, especially from the background.

My favorite aspect is the linkage between newer revisions of software. I notice that they differ in width. Is this used to encode the magnitude of the change?

I feel one of the greatest downsides of this visualization is its height. This makes it difficult to draw comparisons between the smallest and largest codebases without zooming out a good deal.

RE: Week 5: Codebases Millions of Lines of Code - Matthew Turner - 09-28-2014 12:44 PM

I think this is a fun, interesting, and informative visualization, but at the same time I agree with the previous post that pointed out that this graphic raises more questions than it answers and can't really be considered a reliable source.

One of the things I don't like is the combination of changing scales (from hundreds of thousands of lines to millions of lines) and how the final few marks are repeated on the y-axis. It works as far as emphasizing the insane amount of estimated lines of code for healthcare.gov, but I think that using a logarithmic scale would be a better choice for more fine-grained comparisons, or even enabling some sort of interaction to dynamically adjust the scale, e.g. a semantic zoom that animates the items scaled at hundreds of thousands of lines collapsing or expanding based on the current zoom level.

Additionally, I'm not fond of the fact that there is no legend for the hues. The creator chose to introduce the categories as they appear instead, but this adds clutter within the bar chart and can be confusing. The arcs connecting version upgrades immediately caught my eye, and the fact that the 'organism' label appears next to the first arc (between editions of Photoshop) took a second to register as a separate category rather than something related to the arc.

The user u0788158 pointed out that we don't know the procedure behind how "lines of code" are actually calculated, which is especially important considering the categories that aren't even software. I think this is a vital point to consider and is perhaps the core reason why I would consider this visualization serving the purpose of educational entertainment rather than a way to draw informed conclusions about the sizes of various codebases.

RE: Week 5: Codebases Millions of Lines of Code - Shravanthi - 09-28-2014 02:00 PM

Very fascinating to visualize the codebase in different systems. The color choice seems right.
The use of y- axis to represent the different softwares and the millions of lines of code is brilliant.
The fine lines separating the millions is very legible and easy to understand

However the use of arcs to compare the percentages is not neat, thought it was not necessary also.
The comparison with mouse genome and the size of health.gov website I feel does not fit in this visualization and the multiple bars was also not clear.

RE: Week 5: Codebases Millions of Lines of Code - zhiminl - 09-28-2014 04:59 PM

This is a very popular visualization, and I think most people see this visualization many times.
I think the purpose of this visualization try to tell people of the size scale of different application.
And this visualization do tell us many fun stories about the size of different app.

The first time I see the visualization, my feeling is that this visualization is too big. I can see all the information at one time. I think many people want to compare different app’s size but they have to scroll the page back and forward to compare the number and size. Another big problem is that when people try to compare the different size they may want to use the area as a standard but not all of them are in the same scale. The author use the are with percentage to fix this drawback but the size of arc is also a problem.

I agree with matt’s idea that this visualization is a good picture for educational entertainment instead of a way to make some conclusion about the various codebases.

RE: Week 5: Codebases Millions of Lines of Code - Jeff Webb - 09-28-2014 08:30 PM

I found it hard to figure out what was going on in this visualization, at least in terms of the details. The broad outlines are clear: a comparison of lines of code. The y-axis was confusing at first because the scale--very long-- was not clear immediately clear. My attention was first drawn to the x-axis, but that turn out to be just a way of quantify ing changes in the y-axis. At any rate, I think this is an interesting visualization. I like the comparisons with the semi-circles on the because it provides a quasi-time series. And I also like the game-machine-organism comparisons.

RE: Week 5: Codebases Millions of Lines of Code - mmath - 09-28-2014 09:05 PM

I think I have seen this particular visualization a couple of times before. Overall I like how the data is visualized. The semi-translucent arcs do a good job of showing relations between bars without occluding any of the data and the bars do a good job of showing the shear amount of code. Overall though there isn't a lot to criticize.
A couple of changes that might be interesting to see though:
  1. The data represented on a log scale since the same scale could be used along the entire graphic.
  2. An interactive info box that shows relevant information about the project code base such as the the languages, source of the information, what the actual uncompressed size in bytes would be,...