Image Source

When to use Java as a Data Scientist

Ben Weber
Towards Data Science
4 min readAug 15, 2020

--

While Python and R provide rich ecosystems for data scientists to handle a wide range of problems, there are situations in which other programming languages, including Java and Go, should be explored. I have found that hands-on experience with Java has been increasingly useful as I shift my focus
from batch ML pipelines to data products that stream data in real time, or have low latency requirements. Although it may require substantial effort to ramp up on the Java programming language, there are a few situations I’ve encountered where it is beneficial to know Java.

- You are responsible for model production
- You are building a low-latency system

I see learning Java as a useful path for applied scientist positions, which have more of an engineering focus than most data science roles. While Python is still my go to language for day-to-day tasks, whether that’s building a large-scale batch pipeline in PySpark or building an interactive web application with Flask or Dash, leveraging Java means that I can explore building a broader range of data products. In the remainder of this posts, I’ll explore each of these topics in more detail.

You are responsible for model production

In a large organization, it’s typical to separate data scientists from the responsibility of spinning up infrastructure and managing a live data product. Instead, the data scientists may use a platform such as Databricks to run scheduled notebooks, or hand off model specifications to an engineering team using a format such as PMML, where the engineering team is responsible for infrastructure and system maintenance. Tools like AWS SageMaker are making it much easier for small teams to deploy models to production, but it’s a proprietary AWS tool and not the best fit for every data product. One way that data scientists can grow their career is by getting more involved with putting models into production.

The best tool to use for productizing an ML model, where predictions are used in live services or products, will depend on how the model is being served. For example, a model that is applied as part of a streaming pipeline
will use different components than a model that is hosted as an API. The feature generation steps for different types of model workflows can also vary significantly. When you are responsible for building an end-to-end data product, you are essentially building a data pipeline where data is fetched from a source, features are calculated based on the retrieved data, a model is applied to the resulting feature vector or tensor, and the model results are stored or streamed to another system. While Python is great for modeling training and there’s tools for model serving, it only covers a subset of the steps in this pipeline.

This is where Java really shines, because it is the language used to implement many of the most commonly used tools for building data pipelines including Apache Hadoop, Apache Kafka, Apache Beam, and Apache Flink. If you are responsible for building the data retrieval and data aggregating portions of a data product, then Java provides a wide range of tools. Also, getting hands on with Java means that you will build experience with the programming language used by many big data projects.

My preferred tool for implementing these steps in a data workflow is Cloud Dataflow, which is based on Apache Beam. While many tools for data pipelines support multiple runtime languages, there many be significantly performance differences between the Java and Python options. For example, Python pipelines running on Cloud Dataflow can suffer from long startup times when fetching and building the required libraries, versus using pre-built Java JAR files.

You are building a low-latency system

Exposing an Ml model as an HTTP endpoint is a common way of productizing a model. Python libraries such as Flask support this functionality, but there are situations where the performance of these libraries is not viable, if you have large throughput or low latency requirements. For example, you may need to build feature vectors for users in real-time, where there is a firehouse of events streaming to the endpoint. This typically involves working with a NoSQL database, because the latency from relational databases would be too large for the system to operative effective.

If you need to build feature vectors for models in real-time and serve predictions as an endpoint, then Java provides a rich ecosystem for achieving this goal. Java can be used with popular NoSQL offerings include Redis, MongoDB, and Couchbase. There’s also a variety of web frameworks for standing up web endpoints including Spring MVC, Netty, and Rapidoid. While Python can provide the same functionality, it’s not typically used in high-throughput applications such as Ad Tech.

Conclusion

While Python is becoming viable for more and more tools within the big data ecosystem, there are still use cases where data scientists can benefit from leveraging Java. However, whether or not learning Java is useful depends on whether the engineering teams at your organization are already using Java and plan on continuing to author new systems with Java. This is the case at Zynga for a subset of teams, and being able to contribute directly to production code bases is great for more quickly delivering data products. This is becoming increasingly common with the rise of the applied scientist role, and with data scientists that work at startups.

Ben Weber is a distinguished data scientist at Zynga. We are hiring!

--

--