Adventure in programming languages and simple statistics

January 03, 2020

This is a story about how I found myself trying programming languages.

I’ve been running an OLTP type database test (DBT-2, if you’ve heard of it), and noticed the post-processing reporting scripts for analyzing the database test results were running longer than I thought they would. I was running tests on 16 systems at the same time and one particular script was taking just over 60 seconds to run per system. Although I was running the tests simultaneously, I was processing the data from each system sequentially so that processing time added up. While the lack of parallelism is another aspect that could be improved, that got me thinking about whether the script itself could be improved or if this is going to be the norm for analyzing a 13 million row CSV file that is about 500 MB in size. Here is a sample of the data:


1572490811,d,C,0.009459,140665731475200
1572491011,START,,,
1572490813,n,C,0.004410,140677646186240
1572491601,TERMINATED,,,-99952896

The post-processing script is currently written in R. It’s less than 50 lines of code where half of that is processing the data and the other half is displaying it. The processing is basically identifying phases of the test (ramping up the test, test execution, etc.), aggregating counts of successful and failed transactions, and calculating the mean and quantile of the response times per transaction type. At the time, I thought it would be fun to learn how to do basic statistics with a language like R.

Previously the script was written in Python, but not taking advantage of any advanced data types. The Python script reads the data file line by line and counts transactions as it is reading the file. I dusted off the old script from the repository and found that it took about 25 seconds to process the same file.

That got me to thinking… Do I really need data frames to do this? The Python script has about 3 times as many lines of code that I attribute to being the difference between using data frames and not, as opposed to using Python vs R. 150 lines of code doesn’t like a lot, right? Should I have kept using the Python version of the script? Maybe I should try Pandas?

Ultimately, I didn’t try to answer any of those questions.

Recently Julia caught my eye because I saw something that mentioned how Julia has joined the “Petaflop Club”, which includes C, C++, and Fortran. For the moment I became more interested in seeing what Julia is like before revisiting Python or R.

After building a similar script to R using data frames, and another script similar in idea to the previous Python script, it turns out that using data frames in Julia took about 45 seconds while not using data frames ran in about 20 seconds.

After all of that, I think I created more questions than provided answers. Is reducing code size worth the difference in execution time? How does Pandas compare? Should I try R without data frames? Perhaps I need some help using each of those programming languages better to perform those simple statistics more efficiently? Maybe I should try more programming language?

Share this

Relevant Blogs

Random Data

This post continues from my report on Random Numbers. I have begun working on a random data generator so I want to run some tests to see whether different random...
December 03, 2020

More Blogs

Full-text search since PostgreSQL 8.3

Welcome to the third – and last – part of this blog series, exploring how the PostgreSQL performance evolved over the years. The first part looked at OLTP workloads, represented...
November 05, 2020

Números aleatorios

He estado trabajando gradualmente en el desarrollo desde cero de herramientas para probar el rendimiento de los sistemas de bases de datos de código abierto. Uno de los componentes de...
November 04, 2020