The Thesis

All PhD candidates around the world know about the thesis. You always knew about the thesis. It marks the beginning of the end for your career as a PhD and if you actually do it, you can have that cool “Dr.” title that you always wanted in your business card. What is the problem then? Why it seems so frustrating when you are sitting down to do it? The following is based on a true story, actually my story. How I managed to write it down and track my progress.

Problem Definition

A typical PhD follows a simple process: read, think, propose, publish, and the thesis. It is straightforward and one can imagine that if you are already there with the rest of the stuff, the write up would be rather easy. But it is not.

The problem lies, mostly in that writing the thesis is a lengthy and lonely act. You have to do it, nobody will come to your aid, except maybe from your advisor.

In my case, I faced the following problem; for quite some time, I could not motivate myself to write it down. I began writing and half page later, I always stopped. I tried everything, but nothing seemed to motivate me. My advisor got uncomfortable and we began talking about a method to track my progress that would motivate me.

The Idea

Then I saw it, Georgios Gousios’s Thesis-o-meter (see link below). This was a couple of scripts that posted every day the progress of the PhD in each chapter. I decided to do it myself, introducing some alterations that would work better for me.

First, I had to find a tangible way to measure the progress. I thought that was easy, the number of pages. The number of pages of a document is nice, if you want to measure the size of the text, but surely it cannot act as a day-to-day key performance indicator (KPI). And why is that? Because simply if you bootstrap your thesis in LaTeX and you put all the standard chapters, bibliography, etc you will find yourself with at least 15 pages. So, that day I would have an enormous progress. The next day, I would write only text. I think one or two pages. The other day text and I would put on some charts. This will count as three of four pages. Better huh? This is the problem.

If you are a person like me, you could add one or two figures, and say “Ok, I am good for today, I added two pages!”. This is a nice excuse if you want to procrastinate. I needed something that would present the naked truth. That would make me sit there and make some serious progress.

So, number of pages was out of the question, but I thought that we can actually use it. The number of pages will be the end goal with a minimum and a maximum. In Greece, a PhD usually has 150 to 200 pages length (in my discipline of course, computer science). So, I thought, this is the goal: a large block of text around those limits.

Then I thought that my metric should be the number of words in the text instead of the number of pages. Since, I wrote my thesis in LaTeX, I just count the words for each file with standard UNIX tools, for example with the command wc -l myfile.tex. So, the algorithm has the following steps:

  • The goal is set to 150-200 pages in total
  • Each day,
    • Count the words for all files
    • Count the pages of the actual thesis file, for example the output PDF
    • Find the word contribution for that day just by subtracting from the previous’s day word count
    • Find an average of words per number of pages
    • Finally, provide an estimation for the completion of the thesis

Experience Report

I implemented this in Python and shell script. The process worked, each day a report was generated and sent to my advisor, but the best thing was that each day, I saw the estimation trimmed down a little. This is the last report I produced:

     1899 build/2-meta-programming.tex
     1164 build/3-requirements.tex
<    13931 build/thesis.bib
    14058 build/thesis.bib
>    55747 total

---- Progress ----
Worked for 167 day(s) ... 
Last submission: 20121025
Word Count (last version): 55747
Page Count (last version): 179
Avg Words per Page (last version): 311
Last submission effort: 142

---- Estimations ----
Page Count Range (final version): (min, max) = (150, 200)
Word Count Range (final version): (min, max) = (46650, 62200)
Avg Effort (Words per Day): 184
Estimated Completion in: (min, max) = (-50, 35) days, (-2.50, 1.75) months
Estimated Completion Date: (best, worst) = (2012-08-11, 2012-12-16)

The average words per page was 311 and I wrote almost 184 words each day.


I wrote my thesis, but I have not submitted it (at least now, but I hope to soon), for a number of practical reasons. Still, the process succeeded, I found my KPIs and they actually led me to finishing up the work. This is a fact and now I have to find another motivation-driven method to do the rest of the required stuff. C’est la vie.

Related Links and Availability

I plan to release an open source version of my thesis-o-meter in my Github profile soon. I also found various alternative thesis-o-meters:

Original post can be found in XRDS blog

The Zen of Multiplexing (Revisited)

Technologies that are related to software engineering, occasionally produce the following phenomenon; they occasionally re-inventing the wheel. Not only that, but these technologies are always presenting themselves as mind-blowing, we-are-going-to-save-the-world solutions.

First Blood, Part One: Responsive Design

There is a bunch of programmers that always designed websites that did fit quite well in smart phones and tablets. Then they decided to make it to work properly and produced tons of libraries and related technologies that focused on presenting the content correctly on all screen sizes. Ok, and they think that this is a new piece of technology and not a bug fix. They have this great proposition: “Hey, now you can see this website on all devices, and we call it now responsive design”, and we charge accordingly.

First Blood, Part Two: The Zen of Multiplexing

Not so long ago, IP Networking was invented and people thought that it would be cool to have several services multiplexed on the newly founded network data channel. They conceived the ports on the Transport Layer. Ports were numbers assigned to specific services; thus we had the 80 port for the HTTP service, 22 for secure shell (SSH), 20 and 21 for FTP etc.

So far, so good, but it seemed that our fellow software engineers were not happy with that approach, and they decided to do something else instead. First, the idea of permitting only HTTP port (80) appeared. All other ports were designated as insecure (!) and one-by-one network administrators closed them with firewalls. But the need to multiplex was still there, and the re-invention pattern became a reality. The concept was simple; all protocols (or at least many) and services were rewritten to work over HTTP.


What Goes Around, Comes Around

The insecure port mechanism gave its place to a new set of vulnerabilities, like SQL injections, etc. In addition, all modern approaches appeared and many services were rewritten from scratch, utilising the new paradigm.

But in reality the gain was minimal, firewalls were replaced by intrusion detection systems that made deep packet inspections, to identify if the packets transferred over HTTP was indeed legal. All protocol replaced by HTTP, thus inheriting its good and bad characteristics.

The Final Frontier

It seems that there will never be an end to this. Software engineers seem that they feel the need to reinvent the wheel (or parts of it) every now and then, enforcing their clients to pay costly software rewrites.

Of course, many of those alterations contain improvements, sometimes significant ones, but I think that all these is a matter of control. Maybe programmers think that it is better to control all aspects of the software and service multiplexing is a significant one that cannot be left to the operating system or the network to handle it. But the question is, who pays the bill?


Simple Project Code Analysis with JSLoCCount

A few years back, when we were full in the development of SQO-OSS, we built a prototype that calculated simple size metrics for a project, based on sloccount. Back then, I thought (I was much into Software Quality back then) that it would be interested to have a Java implementation of this tool, to better integrate with the SQO-OSS architecture, which was built on the JVM platform.

I built back then, a simple utility named JSLoCCount, which calculated SLoC (Source Lines of Code), CLoC (Comment Line of Code) for many programming languages. In addition, it also provided a simple report with counters for each file type, recognizing the most popular file types based on their extensions.

I recently decided to revive this old project, and make it available on github. It is used simply by executing the following command:

java -jar jsloccount <directory>

The utility produces a print out of the SLoC and CLoC for the project, categorised by language and a file popularity report. Both reports are saved in two CSV files, in current working directory. For example, the reports for JSLoCCount itself, will look like:

Number of Files:

Java Compiled Class File, 14 / 30
Java, 11 / 30
JAR, 1 / 30
ANT Build File, 1 / 30
Other, 3 / 30

Number of Lines (comments):

Java, 544 (89)
ANT Build File, 24 (2)

and the two CSV files:

Resource Type,Source Lines of Code,Comments Lines of Code
ANT Build File,24,2


Resource Type,File Count,Total File Count
Java Compiled Class File,14,30
ANT Build File,1,30

The Scientific Retribution (I did my part) … Part One

Summer is slow, even when I am writing my thesis, and I decided to get the apache logs from the web server that is hosting my website (aka gaijin) and see how many people have downloaded my publications so far. I discovered the logs for 1 year back and voila … the results:

Starting Date: Tue Aug 04 05:21:07 EEST 2009
Ending Date: Mon Aug 23 22:52:25 EEST 2010

  1. Evaluating the Quality of Open Source Software, 578 hits, (Journal)
  2. Compiling regular expressions into Java bytecodes, 376 hits, (MSc thesis)
  3. Python tutorial (Part IV) – Simple GUI construction with wxPython, 358 hits, (Magazine)
  4. Building an e-business platform: An experience report, 328 hits, (Conference)
  5. Python tutorial (Part I) – Introduction, 305 hits, (Magazine)
  6. FIRE/J: Optimizing regular expression searches with generative programming, 278 hits, (Journal)
  7. Applying MDA in enterprise application interoperability: The PRAXIS project, 260 hits, (Conference)
  8. Enabling B2B transactions over the internet through application interconnection: The PRAXIS project, 239 hits, (Conference)
  9. Python tutorial (Part III) – Handling Bibliography with Python, 224 hits, (Magazine)
  10. PEGASUS: Competitive load balancing using inetd, 181 hits, (Conference)
  11. A Software Development Metaphor for Developing Semi-dynamic Web Sites through Declarative Specifications, 171 hits, (Technical Report)
  12. Software Quality Assessment of Open Source Software, 156 hits, (Conference)
  13. Python tutorial (Part II) – Blogging with Python, 153 hits, (Magazine)
  14. Performing peer-to-peer e-business transactions: A requirements analysis and preliminary design proposal, 151 hits, (Conference)
  15. Introducing Pergamos: A Fedora-based DL System Utilizing Digital Object Prototypes, 142 hits, (Conference)
  16. Tuning java’s memory manager for high performance server applications, 141 hits, (Conference)
  17. Enabling B2B transactions over the Internet through Application Interconnection: The PRAXIS Project., 130 hits, (Poster, Conference)
  18. Fortifying applications against XPath injection attacks, 60 hits, (Conference)
  19. J%: Integrating Domain-Specific Languages with Java, 46 hits, (Conference)
  20. Blueprints for a Large-Scale Early Warning System, 38 hits, (Conference)

In addition, I used a Geolocation database for the IP Addresses and listed hits from specific countries:

  1. United States 2299
  2. Greece 1036
  3. Russian Federation 196
  4. China 121
  5. Czech Republic 77
  6. India 63
  7. United Kingdom 53
  8. Netherlands 52
  9. Germany 44
  10. Brazil 34
  11. Thailand 32
  12. Taiwan 31
  13. Korea, Republic of 26
  14. Japan 17
  15. Egypt 15
  16. South Africa 14
  17. Cyprus 11
  18. Spain 10
  19. Ukraine 8
  20. Canada 7
  21. Italy 6
  22. Turkey 5
  23. Sri Lanka 4
  24. Indonesia 3
  25. United Arab Emirates 2
  26. Belarus 1

Total number of downloads was 4872.

My Favorite 2008 Quote

In this publication, Dr. Schlangemann indicates that:

“In this work we better understand how digital-to-analog converters can be applied to the development of e-commerce.”

I really wonder how this paper got accepted. The peer-review process is so problematic? Or maybe the conferences are so many, that scientists are literally bombed with paper reviews?


Doing my PhD for the last 4 years … i have a big collection of articles (have not read them all :-p … but lots of them):

Macintosh:monitor bkarak$ count
Processed [...]/reading/2008/july-2008.bib ... found 22 entries
Processed [...]/reading/papers/toplas-survey.bib ... found 12 entries
Processed [...]/reading/papers/fire.bib ... found 61 entries
Processed [...]/reading/2008/august-2008.bib ... found 7 entries
Processed [...]/reading/papers/bkarak-publications.bib ... found 17 entries
Processed [...]/reading/2008/october-2008.bib ... found 25 entries
Processed [...]/reading/2008/september-2008.bib ... found 11 entries
Processed [...]/reading/papers/dsl-biblio.bib ... found 81 entries
Processed [...]/reading/3rdparty/yannis.bib ... found 61 entries
Processed [...]/reading/2008/may-2008.bib ... found 5 entries
Processed [...]/reading/papers/dsl-biblio2.bib ... found 19 entries
Processed [...]/reading/papers/full.bib ... found 507 entries
Processed [...]/reading/papers/ecoop08.bib ... found 40 entries
Processed [...]/reading/2008/april-2008.bib ... found 4 entries
Processed [...]/reading/papers/dds.bib ... found 277 entries
Processed [...]/reading/2008/june-2008.bib ... found 10 entries
Processed [...]/reading/2008/march-2008.bib ... found 13 entries

Total: 1172 entries in 17 files