Banner

Friday, May 26, 2017

Introducing Kotlin Statistics




Last fall I wrote an article Kotlin for Data Science where I proposed Kotlin as a general programming language for data science and analytics. To my surprise, I later found out a lot of folks read the article and quietly started to investigate the idea. In Spring 2017 the conversation began to grow in the Kotlin community, but nothing set fire to the prospect quite like when Google made its announcement to support Kotlin. With that announcement, as well as the fact Kotlin compiles to JavaScript and soon LLVM, it is clear that Kotlin is poised to gain adoption on multiple domains.

Since my previous article, the topic of integrating data science workflows with software engineering has gained traction across data science communities. O'Reilly posted an interesting article about using Go for data science, and a quick Google search for "DevOps for Data Science" reveals the same theme: the gap between data science and software engineering is a logical next step in progressing data science as a discipline.

I believe that Python and R will continue to be tools used for analytics, but they are not sustainable for workflows that require continuous delivery into production. Of course, libraries like Apache Spark and others are striving to support multiple language API's, but Kotlin has a lot of potential to usher in a bigger picture. That is what I hope to show with the Kotlin Statistics library I released this week. It is not a silver bullet to all problems in data science, nor does it have advanced features like ML at the moment. Rather, I want Kotlin Statistics to show how Kotlin's inferred static typing and abstraction can make data science code simpler and more tactical, but also resilient and refactorable. Not to mention, the tooling for Kotlin is fantastic with Intellij IDEA.


A Quick Tour


I released the Kotlin Statistics library this week. It is not yet at a 1.0 version, but it should give you a good set of tools to start doing fundamental statistical analysis.

Take for example this Kotlin code below where I declare a Patient type, and I include the first name, last name, birthday, and white blood cell count. I also have an enum called Gender reflecting a MALE/FEMALE category. Of course, I could import this data from a text file, a database, or another source, but for now I am going to declare them in literal Kotlin code:

data class Patient(val firstName: String,
                   val lastName: String,
                   val gender: Gender,
                   val birthday: LocalDate,
                   val whiteBloodCellCount: Int)


val patients = listOf(
        Patient("John", "Simone", Gender.MALE, LocalDate.of(1989, 1, 7), 4500),
        Patient("Sarah", "Marley", Gender.FEMALE, LocalDate.of(1970, 2, 5), 6700),
        Patient("Jessica", "Arnold", Gender.FEMALE, LocalDate.of(1980, 3, 9), 3400),
        Patient("Sam", "Beasley", Gender.MALE, LocalDate.of(1981, 4, 17), 8800),
        Patient("Dan", "Forney", Gender.MALE, LocalDate.of(1985, 9, 13), 5400),
        Patient("Lauren", "Michaels", Gender.FEMALE, LocalDate.of(1975, 8, 21), 5000),
        Patient("Michael", "Erlich", Gender.MALE, LocalDate.of(1985, 12, 17), 4100),
        Patient("Jason", "Miles", Gender.MALE, LocalDate.of(1991, 11, 1), 3900),
        Patient("Rebekah", "Earley", Gender.FEMALE, LocalDate.of(1985, 2, 18), 4600),
        Patient("James", "Larson", Gender.MALE, LocalDate.of(1974, 4, 10), 5100),
        Patient("Dan", "Ulrech", Gender.MALE, LocalDate.of(1991, 7, 11), 6000),
        Patient("Heather", "Eisner", Gender.FEMALE, LocalDate.of(1994, 3, 6), 6000),
        Patient("Jasper", "Martin", Gender.MALE, LocalDate.of(1971, 7, 1), 6000)
)

enum class Gender {
    MALE,
    FEMALE
}
If you find the LocalDate.of() or other parts of the declaration to be redundant and wordy, you can easily create functions or type aliases to make things more concise, but I am not going to digress into that right now.
Let's start with some basic analysis: what is the average and standard deviation of whiteBloodCellCount across all the patients? We can leverage some extension functions in Kotlin Statistics to find this quickly:

fun main(args: Array<String>) {

    val averageWbcc =
            patients.map { it.whiteBloodCellCount }.average()

    val standardDevWbcc =
            patients.map { it.whiteBloodCellCount }.standardDeviation()

    println("Average WBCC: $averageWbcc, Std Dev WBCC: $standardDevWbcc")

}
We should get this output:
Average WBCC: 5346.153846153846, Std Dev WBCC: 1412.2177503341948
However, we sometimes need to slice our data not only for more detailed insight but also to judge our sample. For example, did we get a representative sample with our patients for both male and female? We can use the countBy() operator in Kotlin Statistics to count a Collection or Sequence of items by a keySelector as shown here:

fun main(args: Array<String>) {

    val genderCounts = patients.countBy(
            keySelector = { it.gender }
    )

    println(genderCounts)
}

This returns a Map<Gender,Int>, reflecting the patient count by gender. Here is what it looks like in the output from our code above:
{MALE=8, FEMALE=5}
Okay, so our sample is a bit MALE-heavy, but let's move on. We can also find the average white blood cell count by gender using averageBy(). This accepts not only a keySelector lambda but also an intMapper to select an integer off each Patient (we could also use doubleMapper, bigDecimalMapper, etc). In this case, we are selecting the whiteBloodCellCount off each Patient and averaging it by Gender, as shown next:

fun main(args: Array<String>) {

    val averageWbccByGender = patients.averageBy(
            keySelector = { it.gender },
            intMapper = { it.whiteBloodCellCount }
    )

    println(averageWbccByGender)
}

{MALE=5475.0, FEMALE=5140.0}

So the average WBCC for MALE is 5475, and FEMALE is 5140.

What about age? Did we get a good sampling of younger and older patients? If you look at our Patient class, we only have a birthday to work with which is a Java 8 LocalDate. But using Java 8's date and time utilities, we can derive the age in years in the keySelector like this:

fun main(args: Array<String>) {

    val patientCountByAge = patients.countBy(
            keySelector = { ChronoUnit.YEARS.between(it.birthday, LocalDate.now()) }
    )

    println(patientCountByAge)
}
And here is the output:

{28=1, 47=1, 37=1, 36=1, 31=2, 41=1, 25=2, 32=1, 43=1, 23=1, 45=1}

If you look at our output for the code, it is not very meaningful to get a count by age. It would be better if we could count by age ranges, like 20-29, 30-39, and 40-49. We can do this using the binByXXX() operators. If we want to bin by an Int value such as age, we can define a BinModel that starts at 20, and increments each binSize by 10. We also provide the value we are binning using binMapper, which is the patient's age as shown below:

fun main(args: Array<String>) {

    val binnedPatients = patients.binByInt(
            binMapper = { ChronoUnit.YEARS.between(it.birthday, LocalDate.now()).toInt() },
            binSize = 10,
            rangeStart = 20
    )

    binnedPatients.forEach {
        println(it)
    }
}

And here is the output showning all our Patient items binned up in a BinModel, by these age ranges:

Bin(range=20..29, value=[Patient(firstName=John, lastName=Simone, gender=MALE, birthday=1989-01-07, whiteBloodCellCount=4500), Patient(firstName=Jason, lastName=Miles, gender=MALE, birthday=1991-11-01, whiteBloodCellCount=3900), Patient(firstName=Dan, lastName=Ulrech, gender=MALE, birthday=1991-07-11, whiteBloodCellCount=6000), Patient(firstName=Heather, lastName=Eisner, gender=FEMALE, birthday=1994-03-06, whiteBloodCellCount=6000)])
Bin(range=30..39, value=[Patient(firstName=Jessica, lastName=Arnold, gender=FEMALE, birthday=1980-03-09, whiteBloodCellCount=3400), Patient(firstName=Sam, lastName=Beasley, gender=MALE, birthday=1981-04-17, whiteBloodCellCount=8800), Patient(firstName=Dan, lastName=Forney, gender=MALE, birthday=1985-09-13, whiteBloodCellCount=5400), Patient(firstName=Michael, lastName=Erlich, gender=MALE, birthday=1985-12-17, whiteBloodCellCount=4100), Patient(firstName=Rebekah, lastName=Earley, gender=FEMALE, birthday=1985-02-18, whiteBloodCellCount=4600)])
Bin(range=40..49, value=[Patient(firstName=Sarah, lastName=Marley, gender=FEMALE, birthday=1970-02-05, whiteBloodCellCount=6700), Patient(firstName=Lauren, lastName=Michaels, gender=FEMALE, birthday=1975-08-21, whiteBloodCellCount=5000), Patient(firstName=James, lastName=Larson, gender=MALE, birthday=1974-04-10, whiteBloodCellCount=5100), Patient(firstName=Jasper, lastName=Martin, gender=MALE, birthday=1971-07-01, whiteBloodCellCount=6000)])

We can look up the bin for a given age using an accessor syntax. For example, we can retrieve the Bin for the age 25 like this, and it will return the 20-29 bin:

fun main(args: Array<String>) {

    val binnedPatients = patients.binByInt(
            binMapper = { ChronoUnit.YEARS.between(it.birthday, LocalDate.now()).toInt() },
            binSize = 10,
            rangeStart = 20
    )

    println(binnedPatients[25])
}

If we wanted to not collect the items into bins but rather perform an aggregation on each one, we can do that by also providing a groupOp argument. This allows you to use a lambda specifying how to reduce each List<Patient> for each Bin. Below is the average white blood cell count by age range:

fun main(args: Array<String>) {

    val avgWbccByAgeRange = patients.binByInt(
            binMapper = { ChronoUnit.YEARS.between(it.birthday, LocalDate.now()).toInt() },
            binSize = 10,
            rangeStart = 20,
            groupOp = { it.map { it.whiteBloodCellCount }.average() }
    )

    println(avgWbccByAgeRange)
}

Here is the output, showing that the average white blood cell count for each age range is within the 5000's:

BinModel(bins=[Bin(range=20..29, value=5100.0), Bin(range=30..39, value=5260.0), Bin(range=40..49, value=5700.0)])

Using let() for Multiple Calculations


There may be times you want to perform multiple aggregations to create reports of various metrics. This is usually achievable using Kotlin's let() operator. Say you wanted to find the 1st, 25th, 50th, 75th, and 100th percentiles by gender. We can tactically use a Kotlin extension function called wbccPercentileByGender() which will take a set of patients and separate a percentile calculation by gender. Then we can invoke it for the five desired percentiles and package them in a Map<Double,Map<Gender,Double>>, as shown below:


fun main(args: Array<String>) {

    fun Collection<Patient>.wbccPercentileByGender(percentile: Double) =
            percentileBy(
                    percentile = percentile,
                    keySelector = { it.gender },
                    doubleMapper = { it.whiteBloodCellCount.toDouble() }
            )

    val percentileQuadrantsByGender = patients.let {
        mapOf(1.0 to it.wbccPercentileByGender(1.0),
                25.0 to it.wbccPercentileByGender(25.0),
                50.0 to it.wbccPercentileByGender(50.0),
                75.0 to it.wbccPercentileByGender(75.0),
                100.0 to it.wbccPercentileByGender(100.0)
        )
    }

    percentileQuadrantsByGender.forEach(::println)
}

OUTPUT:

1.0={MALE=3900.0, FEMALE=3400.0}
25.0={MALE=4200.0, FEMALE=4000.0}
50.0={MALE=5250.0, FEMALE=5000.0}
75.0={MALE=6000.0, FEMALE=6350.0}
100.0={MALE=8800.0, FEMALE=6700.0}

Summary

This was a somewhat simple introduction to Kotlin Statistics and the functionality I have built so far. Be sure to read the project's README to see a more comprehensive set of operators available in the library. Over time, I plan on improving with linear regression, charting, and other features. I am also thinking of putting in Bayesian model support after I finish scoping it out.

But more importantly, I hope this demonstrates Kotlin's efficacy in being tactical but robust. Kotlin is capable of rapid turnaround for quick ad hoc analysis, but you can take that statically-typed code and put it in production if you need to. While I am seeking to add more functionality to this, it would be awesome to see others contribute to the idea of using Kotlin for these kinds of purposes.

Saturday, May 20, 2017

System76 Galago Pro Review

I have owned a System76 Kudu since Fall of 2016 (which I reviewed here) and it has helped me be much more productive. However, I bought that as a desktop replacement laptop which it excels at, and I needed to buy something more mobile. Thankfully, I pre-ordered the System76 Galago Pro not long after it was announced, and I finally received it a week ago.

Although I have been ridiculously busy wrapping up my second book Learning RxJava, some folks asked if I could write a review. So here it goes:

The Galago Pro is thin, light, sturdy, and beautiful.


Design
  


The System76 Galago Pro has a discrete and ergonomic profile, and is only .56" in height. It is light and thin, and perfect for mobile use even if you have to walk and type with one hand. The aluminum casing is a nice touch and helps it feel sturdy.

Aluminum casing helps this laptop feels sturdy, and it looks cool

The keyboard has great response. The resistance on the keys feels just right. The placement and spacing between them does not feel cramped and it feels even better than my 17" System76 Kudu, so typing is pretty fluid. The trackpad is smooth and recognizes gestures without issue as well.

The keyboard is not cramped and its design feels optimized.


Hardware

I did not upgrade a lot of the hardware when I bought my Galago Pro. I kept it pretty modest as shown below. As a developer working on open-source projects and writing books, this configuration is plenty.

Ubuntu 17.04 (64-bit)    
1× 13.3" Anti-glare 3K HiDPI Display    
Intel® HD Graphics 620    
3.1 GHz i5-7200U (2.5 up to 3.1 GHz – 3MB Cache – 2 Cores – 4 Threads)    
4 GB DDR4 at 2133MHz (1× 4 GB)    
250 GB M.2 SSD     $59.00
No 2nd Drive    
United States Keyboard    
WiFi up to 433 Mbps + Bluetooth    

Although it does not affect me, it is too bad international layouts are not available for the keyboard. I know a few folks in Europe who would like to order a System76 but are not satisfied using stickers on their keyboard. I understand System76 is working on this though.

Finishing the final chapter of Learning RxJava on my Galago Pro.

I upgraded to a 250 GB hard drive, but went with the default M.2 SSD instead of the much more expensive PCIe. For my purposes, I found this to boot quickly and perform fast enough. I do not do a lot of video or picture editing where an ultra-fast hard drive can make a difference.

I wish the battery was more ambitious than 4-5 hours. You could probably squeeze more out of it by using airplane mode and lowering screen brightness. But I've found doing word processing with Internet gives me about 4-5 hours. If I'm using an intensive IDE like Intellij IDEA and writing Kotlin code (with Internet), it gravitates towards 3-4. I understand this is about the same performance as the current MacBook Pro, so this is not bad. But it would be awesome to see the boundaries of battery life pushed farther with an ambitious machine like this.

Of course, the big selling point with the Galago Pro is the ports. It has plenty of them!

Lock, Ethernet, SD/MMC, HDMI, Mini DisplayPort, USB, USB-C (w/ Thunderbolt 3)


Power, SIM, USB, microphone, and headphone

Unlike the recent MacBook, you will likely not need any dongles here. It is impressive how many ports have been packed into such a thin device. What I found most intriguing is how System76 fit the Ethernet jack, which has a door that flips down to hold the Ethernet cable as shown below:


The Ethernet port has a clever collapsing door

I am glad System76 was not quick to slash the Ethernet port but rather found an innovative way to include it into the design of the laptop. While I would not deliberately test this, the door feels pretty sturdy against my everyday abuse of pulling a cable in-and-out. It is also level with the table top when a cable is inserted.

Having an Ethernet port is especially life-saving when you encounter WiFi driver issues, and you need to connect to the Internet to get them.

Setup

System76 ships its computers with Ubuntu, but I prefer to use the Linux Mint distro. While Linux Mint is based on Ubuntu, I find Linux Mint to provide a much more fluid experience and "just works" when it comes to usability (although it is promising what System76 is doing with the GTK "Pop" theme). Normally, putting your own Linux distro on a System76 machine is a problem-free experience. Just make sure to install the System76 drivers.




The Galago Pro worked smoothly with the default Ubuntu installation. However, I ran into a driver problem when I installed Linux Mint 18.1 (the latest version at the time of writing). It has an older Linux Kernel version that does not include drivers for the Galago Pro's new hardware, including the Intel wireless chip. This meant I had no wireless Internet to solve the problem, and thankfully the Ethernet port came in to save the day. I updated the Linux Kernel and then everything worked.
 
System76's customer service is always stellar, and unlike many companies are helpful towards tinkerers and hackers. I did send a message to them and suggested their System76 driver should check the Linux kernel version, and they were immediately responsive and forwarded that to their engineering team. They apologized that I had any difficulties in the first place, as they strive to have everything work even if you use a different Linux distro.

Summary

The System76 Galago Pro is a beautiful machine that feels highly productive for a 13" ultrabook. The keyboard and trackpad feel phenomenal, and the HDPI screen is beautiful. But what really stands out are the many physical ports to get plugged in, including a clever Ethernet port for those of us that like to be wired.

The only place I wish the Galago Pro pushed the boundaries a bit more is battery life. I get about 4-5 hours with moderate screen brightness and doing everyday work. However, this sounds to be on par with the current Macbook Pro, so it is unfair to cite this as a downside. But in a perfect world, 8 hours would be nice.

If you are looking for a high-quality, mobile alternative to Macbook, Surface, or other mobile productivity devices, the Galago Pro is great. It truly excels at the intersect between mobility and not cutting corners, and it just looks and feels cool.




Sunday, December 11, 2016

System76 Kudu Review

About a week ago, I ordered and received the System76 Kudu laptop. Here is my review after six days of use.

The System76 Kudu, a 17.3 inch laptop

System76 is a relatively young computer brand that differentiates itself by making high-quality laptops installed with Ubuntu, the most popular Linux distro available. Since Apple's underwhelming 2016 MacBook announcements, System76 announced their website crashed due to the surge in traffic and orders. While Linux is still relatively obscure to the consumer public, System76 has taken advantage of the sudden interest from Apple users.

I have never been a MacBook person. I used a MacBook to prototype an iOS app in Objective-C back in 2011. I was not terribly thrilled with the developer tools at the time, so I never felt compelled to switch from my Windows/Linux dual boot setup. I switched to Windows almost exclusively for a time after I bought a Surface Pro 3. While it is a great piece of hardware, Windows 10 was slow and updated at the most inopportune times. There are few things more annoying than getting ready to project my Surface for a business meeting, and I sat there hoping and waiting the update will finish in time. After this happened on a business trip a few weeks ago, I made up my mind to switch back to Linux.

As I became proficient as a developer, I was already gravitating back to Linux on my desktop computer, and became particularly fond of Linux Mint which is based off Ubuntu. From my experience, Linux Mint is snappier, faster, less buggy, and prettier than Ubuntu. I tried to put Linux Mint on my Surface Pro 3, but the UEFI made it an absolute pain to use. Not to mention, the drivers hacked together in the linux-surface package felt fine at first, but broke down and froze my machine over time.

I decided to jump ship on Windows devices and wanted a UEFI/SecureBoot-free laptop. Although not fashionable, I wanted a 17.3" screen because I needed a large amount of screen real estate for the projects I am doing. Although this would make me less mobile, I did not plan on using it on the plane. Initially I looked at Dell Ubuntu machines, but then a colleague from an open-source project suggested I check out System76, and I immediately became drawn to the Kudu.

I use my laptops for work, and primarily do coding, technical writing, and instructional videos for O'Reilly.  Therefore I do not need high-end graphics capability or a 4K display, or else I would have gone with the popular System76 Oryx Pro. But the Kudu seemed to be what I needed. When I bought it, it featured a 6th-gen Intel i7 and a 1080P display. I customized it to have 8GB of RAM and a 256 GB PCIe M.2 SSD. Out the door, I paid around $1150. I'm glad to say it was worth every penny.


The System76 feels solid but sleek for a 17.3" laptop. It does not have a "hollow" cheap feel that many PC laptops have in this category



Installing Linux Mint

I tried to use the stock installation of Ubuntu, but after a few hours I wiped it and installed Linux Mint, which runs phenomenally well on it. What is great about System76 is you can take a vanilla Ubuntu, Linux Mint, and most other Linux distro images and install them with no drama. System76 makes all of their drivers available as a PPA so you can optimize Ubuntu or Linux Mint easily. It is nice to not have to use a proprietary OS with customizations (or even worse a Windows license key). You can nuke and re-install the Linux OS with no hassles at any time. This is a huge feature especially as Microsoft is now locking down Windows devices with UEFI and SecureBoot.

I installed Linux Mint 18 Cinnamon Edition on my Kudu, which I prefer over Ubuntu.


Appearance and Mobility

When I started using the Kudu, the first thing I noticed is it does not have that cheap "hollow" feel that many large-screen work laptops have. It feels fairly solid but is not too heavy at 6.8 lbs. Granted you will not want to walk long distances with it in one arm, but it is fairly mobile for a 17.3 inch laptop. I am able to work with it comfortably on my lap, but it probably would be difficult to use on a plane or other tight areas.

It has a fairly low profile as well. It is not an ultrabook for sure, but when folded closed does not feel like a brick.

The Kudu has a relatively compact profile


The Keyboard and Track Pad

The keyboard and track pad are comfortable and reliable. The keys are backlit and have a slight concave which feels great to type on. The track pad is responsive but not over-sensitive, and it supports gestures nicely. I have used many Windows laptops (including the Surface Pro 4 Touch Keyboard) and none of them approach this level of quality in a track pad. I have not used a MacBook in a while so I cannot compare to its track pad, which I understand is the best in terms of standards.

I have heard some people have issues activating the track pad accidentally while typing. I have not had this issue but System76 provides an easy means to turn the track pad on/off.

Notice the Ubuntu key!


Functionality and Performance

This laptop is fast. Of course I paid an extra $190 or so for the PCIe M.2 SSD, but it was well worth it. Everything from Intellij IDEA to Atom Editor opens almost instantly. When I booted Linux Mint off a USB stick, I recall the OS loaded in a few seconds.

I never was a big fan of Ubuntu after discovering Linux Mint. Although Linux Mint is based off Ubuntu, it does a much better job of "just working" from my experience. It is faster and more intuitive to navigate. The Cinnamon version of Linux Mint feels like a modernized Windows XP with better aesthetics and lean resource usage. I even put my parents' desktop on Linux Mint after Windows crashed, and they have used it daily without complaints.

That being said, you can use the stock Ubuntu installation or put Linux Mint on easily. Both are compatible with the same Debian/Ubuntu-based software which I'll discuss later.

The screen is beautiful and big, with no backlight bleeds or dead pixels. It is easy to multitask and have multiple windows open. Being able to review and edit code with the large workspace is also a plus. One task I especially am happy with is doing Markdown editing for books. Having enough screen real estate to have the editor on the left and the rendering on the right makes a huge difference to productivity. I also had some annoying lag on my Surface Pro 3 working with Atom Editor in Markdown Preview, but it is pretty snappy on the Kudu.


Writing books with Atom Editor on the Kudu's large HD screen is an absolute joy


Overrall, Linux Mint feels like it is made for the System76 Kudu. Everything is snappy, fast, and instantaneous. The Ubuntu key brings up the home menu and everything works optimally out-of-the-box.

The Software

I have been using Ubuntu and Linux Mint off-and-on for about 4 years now. Since Linux Mint is based off Ubuntu (which is based off Debian), you can easily install software built for any of those distros. LibreOffice comes pre-installed for both Ubuntu and Linux Mint, not that it matters since you can download it for free at any time. For a Microsoft Office alternative, LibreOffice works pretty well. I have had many issues moving presentations from Impress to PowerPoint and vice versa. The cross-compatibility is somewhat exaggerated as slide content can be misaligned and scattered. I learned to stick with PowerPoint if I am going to give my presentations to Office users.

Speaking of Microsoft Office, you can run Windows 10 inside VirtualBox for free. VirtualBox is an open-source virtual machine software that allows you to run an operating system inside an operating system. In other words, you can run Windows 10 inside Ubuntu or Linux Mint.

Running Windows 10 inside Linux Mint using VirtualBox

Although Microsoft does not advertise it, Windows 10 is effectively free now and you do not need a license key. This means it costs nothing to set up a Windows 10 virtual machine inside your Ubuntu or Linux Mint installation. I have used VirtualBox for a couple of years and never had any issues with it, other than it uses more battery.

You can also dual-boot or even exclusively use Windows 10 instead of Linux.

Conclusion

The System76 Kudu is probably the most productive laptop I ever had, outperforming my Surface Pro 3 and every other device I have owned. The hardware feels great and runs snappy. The experience of using a System76 has that "premium" quality lacking in most PC's now. If you are interested in graphics-intensive gaming and multimedia applications, you might want to consider getting the System76 Oryx Pro. But for a workhorse laptop, the Kudu is great.

System76 has some great help guides, especially for non-techies coming from Mac OS X or Windows. There is growing support for open-source creative software as well. For developers and power users, it provides a hardware experience that does Linux justice.


Wednesday, November 30, 2016

Using the Kotlin Language with Apache Spark


About a month ago I posted an article proposing Kotlin as another programming language for data science. It is a pragmatic, readable language created by JetBrains, the creator of Intellij IDEA and PyCharm. It has received growing popularity on Android and focuses on industrial use rather than experimental functionality. Just like Java and Scala, Kotlin compiles to bytecode and runs on the Java Virtual Machine. It also works with Java libraries out-of-the-box with no hiccups, and in this article I’m going to show how to use it with Apache Spark.

Officially, you can use Apache Spark with Scala, Java, Python, and R. If you are happy using any of these languages with Spark, you likely will not need Kotlin. But if you tried to learn Scala or Java and found it was not for you, you might want to give Kotlin a look. It is a legitimate fifth option that works out-of-the-box with Spark.

I recommend using Intellij IDEA as it natively includes Kotlin support. It is an excellent IDE that you can also use with Java and Scala. I also recommend using Gradle for your build automation.
Kotlin is replacing Groovy as the official scripting language for Gradle builds. You can read more about it in the article Kotlin Meets Gradle.

Setting Up

To get started, make sure to install the following:
  • Java JDK - Java JDK
  • Intellij IDEA - IDE for Java, Kotlin, Scala, and other JVM projects
  • Gradle - Build automation system, download Binary Only distribtion and unzip it to a location of your choice
You will need to configure Intellij IDEA to use your Gradle location. Launch Intellij IDEA and set this up in Settings -> Build, Execution, and Deployment -> Gradle. If you have trouble there should be plenty of walkthroughs online.

Let’s create our Kotlin project. Using your operating system, create a folder with the following structure:
kotlin_spark_project
      |
      └────src
            |
            └────main
                  |
                  └────kotlin

Your project folder needs to have a folder structure inside of it containing /src/main/kotlin/. This is important so Gradle will recognize this as a Kotlin project.
Next, create a text file named build.gradle and use a text editor to put in the following contents. This is the script that will configure your project as a Kotlin project. You can read more about Kotlin Gradle configurations here.
buildscript {
    ext.kotlin_version = '1.0.5'
    repositories {
        mavenCentral()
    }
    dependencies {
        classpath "org.jetbrains.kotlin:kotlin-gradle-plugin:$kotlin_version"
    }
}

apply plugin: "kotlin"

repositories {
    mavenCentral()
}

dependencies {
    compile "org.jetbrains.kotlin:kotlin-stdlib:$kotlin_version"

    //Apache Spark
    compile 'org.apache.spark:spark-core_2.10:1.6.1'
}

Finally, launch Intellij IDEA and click Import Project and navigate to the location of your Kotlin project folder you just created. In the wizard, check Import project from external model with the Gradle option. Click Next, then select Use Local Gradle Distribution with the Gradle copy you downloaded. Then click Finish.

Your workspace should now be set up with a Kotlin project as shown below. If you do not see the project explorer on the left press ALT + 1. Then double-click on the project folder and navigate down to the kotlin folder.




Right click the kotlin folder and select New -> Kotlin File/Class.



Name the file “SparkApp” and press OK. You will now see a SparkApp.kt file added to your kotlin folder. An editor will open on the right.

Using Spark with Kotlin

Let’s put our Spark usage in the SparkApp.kt file. Spark was written with Scala. While Kotlin does not work directly with Scala, it does have 100% interoperability with Java. Thankfully, Spark has a Java API by providing a JavaSparkContext. We can leverage this to use Spark out-of-the-box with Kotlin.
Create a main() function below which will be the entry point for our Kotlin application. Be sure to import the needed Spark dependencies as well. In your main() function, configure your SparkConf and create a new JavaSparkContext off of it.
import org.apache.spark.SparkConf
import org.apache.spark.api.java.JavaSparkContext

fun main(args: Array<String>) {

    val conf = SparkConf()
            .setMaster("local")
            .setAppName("Kotlin Spark Test")

    val sc = JavaSparkContext(conf)
}

The JavaSparkContext provides a Java API to create Spark streams. Thankfully, we can use the excellent Kotlin lambda syntax which the Kotlin compiler will translate into the needed Java functional types.
Let’s turn a List of Strings containing alphanumeric text values separated by / characters. Let’s break these alphanumeric values up, filter only for numbers, and then find their sum.
import org.apache.spark.SparkConf
import org.apache.spark.api.java.JavaSparkContext
import kotlin.reflect.KClass

fun main(args: Array<String>) {

    val conf = SparkConf()
            .setMaster("local")
            .setAppName("Kotlin Spark Test")

    val sc = JavaSparkContext(conf)

    val items = listOf("123/643/7563/2134/ALPHA", "2343/6356/BETA/2342/12", "23423/656/343")

    val input = sc.parallelize(items)

    val sumOfNumbers = input.flatMap { it.split("/") }
            .filter { it.matches(Regex("[0-9]+")) }
            .map { it.toInt() }
            .reduce {total,next -> total + next }

    println(sumOfNumbers)
}

If you click the Kotlin logo right next to your main() function in the gutter, you can run this Spark application.



A console should pop up below and start logging Spark’s events. I did not turn off logging so it will be a bit noisy. But ultimately you should see the value of sumOfNumbers printed.


Conclusion

I will show a few more examples in the coming weeks on how to use Kotlin with Spark (you can also check out my GitHub project). Kotlin is a pragmatic, readable language that I believe has potential for adoption in Spark. It just needs more documentation for this purpose. But If you want to learn more about Kotlin, you can read the Kotlin Reference as well as check out a few books that are out there. I heard great things about the O’Reilly video series on Kotlin which I understand is helpful for folks who do not have knowledge on Java, Scala, or other JVM languages.

If you learn Kotlin you can likely translate existing books and documentation on Spark into Kotlin usage. I’ll do my best to share my discoveries and any nuances I may encounter. For now, I do recommend giving it a look if you are not satisfied with your current languages.

Friday, October 28, 2016

Kotlin for Data Science

Can Kotlin be an Effective Alternative for Python and Scala?

As I started diving formally into data science, I cannot help but notice there is a large gap between data science and software engineering. It is good to come up with prototypes, ideas, and models out of data analysis. However, executing those ideas is another animal. You can outsource the execution to a software engineering team, and that can go well or badly depending on a number of factors. In my experience, it is often helpful to do the execution yourself or at least offer assistance by modeling towards production.

Although Python can be used to build production software, I find its lack of static typing causes difficulty in scaling. It does not easily plug in with large corporate infrastructures built on the Java platform either. Scala, although an undeniably powerful JVM language, is somewhat esoteric and does not click with everyone, especially those who do not have a software engineering background or love of expressing code with mathematical flair. But Kotlin, a new JVM language by JetBrains (the creator of Intellij IDEA, PyCharm, and dozens of other developer tools), has an active community, rapid growth and adoption, and might serve as a pragmatic alternative to Scala. Although Kotlin is unlikely to replace Python's numerical efficiency and data science libraries, it might make a practical addition to your toolbelt especially since it works with Spark out-of-the-box. as shown below.

import org.apache.spark.SparkConf
import org.apache.spark.api.java.JavaSparkContext
import kotlin.reflect.KClass
 
fun main(args: Array<String>) {
 
    class MyItem(val id: Int, val value: String)
 
    val conf = SparkConf()
            .setMaster("local")
            .setAppName("Line Counter")
 
    conf.registerKryoClasses(MyItem::class)
 
    val sc = JavaSparkContext(conf)
 
    val input = sc.parallelize(listOf(MyItem(1,"Alpha"),MyItem(2,"Beta")))
 
    val letters = input.flatMap { it.value.split(Regex("(?<=.)")) }
            .map { it.toUpperCase() }
            .filter { it.matches(Regex("[A-Z]")) }
    println(letters.collect())
}
/*extension function to register Kotlin classes*/
fun SparkConf.registerKryoClasses(vararg args: KClass<*>) = 
    registerKryoClasses(args.map { it.java }.toTypedArray()

I'll cover Spark with Kotlin another time, but you can look at this simple GitHub project if you like. Apache Spark is definitely a step in the right direction to close the gap between data science and software engineering, or more specifically, turning an idea into immediate execution. You can use SparkR and PySpark to interface R and Python with Spark. But if you want to use a production-grade JVM language, the only mainstream options seem to be Scala and Java. But as stated Kotlin works with Spark too, as it is 100% interoperable with all Java libraries.

A Comparison Between Python and Kotlin

Let's take a look at a somehwat simple data analysis case study. For now, we will leave Scala and Spark out to only compare Kotlin with Python. What I want to highlight is how Kotlin has the tactical conciseness of Python, and maybe in some ways brings a little more to the table as a language for data analysis. Granted, there are not a lot of mainstream JVM data science libraries other than Apache Spark. But they do exist and perhaps there is room for growth, and maybe the language may be worth keeping your eye on (and even exploring).

This comparison was inspired by the social media example from the first chapter of Data Science from Scratch (Grus, O'Reilly). Let's start with declaring two sets of data, users and friendships. Using simply dicts, lists, and tuples without any classes, this is how it could be done in Python.

Python

users = [ 
    { "id" : 0, "name" : "Hero" }, 
    { "id" : 1, "name" : "Dunn" }, 
    { "id" : 2, "name" : "Sue" }, 
    { "id" : 3, "name" : "Chi" }, 
    { "id" : 4, "name" : "Thor" }, 
    { "id" : 5, "name" : "Clive" }, 
    { "id" : 6, "name" : "Hicks" }, 
    { "id" : 7, "name" : "Devin" }, 
    { "id" : 8, "name" : "Kate" }, 
    { "id" : 9, "name" : "Klein" }, 
]
 
friendships = [ 
    (0,1), 
    (0,2), 
    (1,2), 
    (1,3), 
    (2,3), 
    (3,4), 
    (4,5), 
    (5,6), 
    (5,7), 
    (6,8), 
    (7,8), 
    (8,9)
]

The users is a List of dict items, and the friendships are a List of Tuple items. A feature of dynamic typing is you can be fast-and-loose creating data structures that maintain a raw data-like nature. There is no enforcement to uses classes or explicit types.

The equivalent to doing this in Kotlin would look like this:

Kotlin

val users = listOf(
    mapOf("id" to 0, "name" to "Hero"),
    mapOf("id" to 1, "name" to "Dunn"),
    mapOf("id" to 2, "name" to "Sue"),
    mapOf("id" to 3, "name" to "Chi"),
    mapOf("id" to 4, "name" to "Thor"),
    mapOf("id" to 5, "name" to "Clive"),
    mapOf("id" to 6, "name" to "Hicks"),
    mapOf("id" to 7, "name" to "Devin"),
    mapOf("id" to 8, "name" to "Kate"),
    mapOf("id" to 9, "name" to "Klein")
    )
 
 val friendships = listOf(
        listOf(0,1), listOf(0,2), listOf(1,2), listOf(1,3), listOf(2,3), listOf(3,4),
        listOf(4,5), listOf(5,6), listOf(5,7), listOf(6,8), listOf(7,8), listOf(8,9)
 )

For the friendships, we can actually create Pair<Int,Int> items. Kotlin does not really encourage Tuples (or any collection with differing types) and we will see what it offers instead later with the data class. But let's use Pairs in this example instead using the to operator.

val friendships = listOf(
        0 to 1, 0 to 2, 1 to 2, 1 to 3, 2 to 3, 3 to 4,
        4 to 5, 5 to 6, 5 to 7, 6 to 8, 7 to 8, 8 to 9
)

This may look effective at first glance, and Kotlin is statically typed. It is inferring the type for users and friendships as List<Map<String,Any>> and List<List<Int> respectively. Notice how friendships is a List containing Map<String,Any> items, meaning each item has a String for a key and an Any for the value. The reason for the Any is some values are String and others are Int (due to the "id" and "name"), and because the type is not consistent it cast them back down to Any. If we want to use Hero's "id", we need to cast it back up to an Int for it to be treated like an Int rather than a raw Any.

val herosId = users[0]["id"] as Int

Of course, if an "id" value is slipped in as a String accidentally this would throw an error. You can check if it is an Int, but at this point we are just fighting the statically-typed nature of Kotlin (just like Java and Scala). In Kotlin, we are much better off creating a class and doing things the statically-typed way. While this may make dynamic-typing advocates moan, check this out. Kotlin has a concise, readable way of declaring a class quickly and easily, even exceeding Python's standards

Python

class User(Any):
    def __init__(self, id, name):
        self.id = id
        self.name = name
 
    def __str__(self):
        return "{0}-{1}".format(self.idself.name)
 
users = [ 
    User(0,"Hero"), 
    User(1,"Dunn"), 
    User(2,"Sue"), 
    User(3,"Chi"), 
    User(4,"Thor"), 
    User(5,"Clive"), 
    User(6,"Hicks"), 
    User(7,"Devin"), 
    User(8,"Kate"), 
    User(9,"Klein"), 
]

Kotlin

data class User(val id: Int, val name: String)
 
val users = listOf(
      User(0, "Hero"),
      User(1, "Dunn"),
      User(2, "Sue"),
      User(3, "Chi"),
      User(4, "Thor"),
      User(5, "Clive"),
      User(6, "Hicks"),
      User(7, "Devin"),
      User(8, "Kate"),
      User(9, "Klein")
    )

Not too shabby, right? Technically, we did less typing (as in keyboard typing) than Python (76 characters less to be exact, excluding spaces). And we achieved static typing in the process.

Kotlin is certainly a progressive language compared to Java, and it even has practical features like data classes. We made our User a data class, which will automatically implement functionality typically used for classes holding plain data. It will implement toString() and hashcode()/equals() using the properties, as well as a nifty "copy-and-modify" builder by using a copy() function. (This helps aid flexibility while maintaining immutability, which is valued in software engineering).

Kotlin

data class User(val id: Int, val name: String)
 
val user = User(10,"Tom")
val changedUser = user.copy(name = "Thomas")
 
println("Old user: $user")
println("New user: $changedUser")

OUTPUT:

Old user: User(id=11, name=Tom)
New user: User(id=11, name=Thomas)

NOTE: In Kotlin, val precedes the declaration of an immutable variable. var precedes a mutable one.

Data classes are a valuable tool especially for working with data. And yes, Kotlin supports named arguments for constructors and functions as shown in the copy() function above.

Let's return back to our example. Say we wanted to find the mutal friends between two Users. Traditionally in Python, you would create a series of helper functions to assist in this task.

Python

class User(object):
    def __init__(self, id, name):
        self.id = id
        self.name = name
 
    def __str__(self):
        return "{0}-{1}".format(self.idself.name)
 
users = [ 
    User(0,"Hero"), 
    User(1,"Dunn"), 
    User(2,"Sue"), 
    User(3,"Chi"), 
    User(4,"Thor"), 
    User(5,"Clive"), 
    User(6,"Hicks"), 
    User(7,"Devin"), 
    User(8,"Kate"), 
    User(9,"Klein"), 
]
 
friendships = [ 
    (0,1), 
    (0,2), 
    (1,2), 
    (1,3), 
    (2,3), 
    (3,4), 
    (4,5), 
    (5,6), 
    (5,7), 
    (6,8), 
    (7,8), 
    (8,9)
]
 
def user_for_id(user_id):
    for user in users:
        if user.id == user_id:
            return user
 
 
 
def friends_of(user):
    for friendship in friendships:
        if friendship[0] == user.id or friendship[1] == user.id:
            for other_user_id in friendship:
                if other_user_id != user.id:
                    yield user_for_id(other_user_id)
 
def mutual_friends_of(user, otherUser):
    for friend in friends_of(user):
        for other_friend in friends_of(otherUser):
            if (friend.id == other_friend.id):
                yield friend
 
 
# print mutual friends between Hero and Chi 
 
for friend in mutual_friends_of(users[0],users[3]):
    print(friend)

OUTPUT:

1-Dunn
2-Sue

But we can do something similar in Kotlin. This is our first pass, so I'll show a better way in a moment.

Kotlin

fun main(args: Array<String>)  {
 
    data class User(val id: Int, val name: String)
 
    val users = listOf(
        User(0,"Hero"),
        User(1, "Dunn"),
        User(2, "Sue"),
        User(3, "Chi"),
        User(4, "Thor"),
        User(5, "Clive"),
        User(6, "Hicks"),
        User(7, "Devin"),
        User(8, "Kate"),
        User(9, "Klein")
        )
 
    
    val friendships = listOf(
            0 to 1, 0 to 2, 1 to 2, 1 to 3, 2 to 3, 3 to 4,
            4 to 5, 5 to 6, 5 to 7, 6 to 8, 7 to 8, 8 to 9
    )
 
    fun userForId(id: Int): User {
        for (user in users)
            if (user.id == id)
                return user
        throw Exception("User not found!")
    }
 
    fun friendsOf(user: User): List<User> {
        val list = mutableListOf<User>()
        for (friendship in friendships) {
            if (friendship.first == user.id)
                list += userForId(friendship.second)
            if (friendship.second == user.id)
                list += userForId(friendship.first)
        }
        return list
    }
 
    fun mutualFriendsOf(user: User, otherUser: User): List<User> {
        val list = mutableListOf<User>()
        for (friend in friendsOf(user))
            for (otherFriend in friendsOf(otherUser))
                if (friend.id == otherFriend.id)
                    list += friend
 
        return list
    }
 
    for (friend in mutualFriendsOf(users[0],users[3]))
        println(friend)
}

OUTPUT:

User(id=1, name=Dunn)
User(id=2, name=Sue)

Although Kotlin seems to have lost in this example by being wordier and resorting to Lists, hold on. Kotlin has no direct concept of generators and yield keywords. However, we can accomplish something that fulfills the same purpose (and is arguably stylistically better) through Sequence.

fun main(args: Array<String>)  {
 
    data class User(val id: Int, val name: String)
 
    val users = listOf(
        User(0,"Hero"),
        User(1, "Dunn"),
        User(2, "Sue"),
        User(3, "Chi"),
        User(4, "Thor"),
        User(5, "Clive"),
        User(6, "Hicks"),
        User(7, "Devin"),
        User(8, "Kate"),
        User(9, "Klein")
        )
 
    
    val friendships = listOf(
            0 to 1, 0 to 2, 1 to 2, 1 to 3, 2 to 3, 3 to 4,
            4 to 5, 5 to 6, 5 to 7, 6 to 8, 7 to 8, 8 to 9
    )
 
    fun userForId(id: Int) = users.asSequence().filter { it.id == id }.first()
 
    fun friendsOf(user: User) = friendships.asSequence()
            .filter { it.first == user.id || it.second == user.id }
            .flatMap { sequenceOf(it.first,it.second) }
            .filter { it != user.id }
            .map { userForId(it) }
 
    fun mutualFriendsOf(user: User, otherUser: User) = friendsOf(user).flatMap { friend ->
        friendsOf(otherUser).filter { otherFriend -> otherFriend.id == friend.id }
    }
 
    mutualFriendsOf(users[0],users[3]).forEach { println(it) }
}

OUTPUT:

User(id=1, name=Dunn)
User(id=2, name=Sue)

We can use the Sequence to compose a series of operators as a chain, like filter(), map(), flatMap(), and many others. This style of functional programming has been getting a lot of traction over the years thanks to LINQ, primarily because it easily breaks up logic into simple pieces and increases maintainability. 99.99% of the time, I am never using for loops but rather using a Kotlin Sequence, a Java 8 Stream, or an RxKotlin/RxJava Observable. This chain-operator syntax is becoming less alien in Python as well (look at PySpark and RxPy). What is great about this style of programming is you can read what is happening left-to-right, top-to-bottom rather than jumping through several loops and helper functions.

Conclusions

In the coming months, I am going to blog about my experiences using Kotlin for data science, and I will continue to share what I learn. I may throw in an article occasionally covering ReactiveX for data science as well (for both Kotlin and Python). I acknowledge that the Java JVM platform, which Kotlin runs on, does not handle numbers as efficiently as Python or R (maybe Project Valhalla will change that?). But successful models inevitably need to turn into execution, and the Java platform increasingly seems to be the place that happens.

Kotlin merely provides a pragmatic abstraction layer that provides a tactical and concise syntax that seems excellent not just for data analysis, but also executing software. Outside of data science, Kotlin has spurred many successful open-source libraries even before a year after its release. One library, TornadoFX, allows rapid turnaround of complex business user interfaces using Kotlin (As a disclaimer, I have helped with that project). The Kotlin community is active, growing, and engaged on the Slack channel. It continues to be adopted on Android as well as backends, and JetBrains is using Kotlin to build all their tools (including PyCharm and Intellij IDEA). It is also replacing Groovy as the official language for Gradle. Because of these facts, I do not see Kotlin's momentum slowing down anytime soon.

I believe Kotlin could make a great first JVM language, more so than Java or Scala (I struggle to make Jython count). If you are already happy with Scala or Java you will likely not need Kotlin. But for folks wanting to break into JVM languages, there is a new O'Reilly video series that covers Kotlin from scatch. Its instructor Hadi Hariri (one of the JetBrains folks behind Kotlin) believes Pythonistas should be able to follow along. He said anybody familiar with classes, functions, properties, etc should be able to learn Kotlin in a day with this video series. Unfortunately, the existing Kotlin documentation and books assume prior Java knowledge, and hopefully more resources other than the video pop up in the future.

There is a lot of exciting features I have not covered about Kotlin in this article. Features like nullable types, extension properties and functions, and boilerplate-free delegates make the language pleasant to use and productive. So check out Kotlin if you are using Python for data science and wanting to learn a JVM language. Again, this is not a proposal to drop your current tools, but rather consider exploring an additional one that may help you tackle new problems. I will continue blogging about my experiences with Kotlin, and showcase it being used in deeper data science topics as well as Spark.