Status: In Development
Affects Version/s: None
Fix Version/s: None
Component/s: Core Infrastructure
The main goal of this new infrastructure is to build the right fundamentals so we can take the Liferay platform to the next level by increasing the resiliency and performance of the Liferay platform.
By pursuing the previous goal, we aim to build a slim microservices platform where you can easily deploy your JVM based services, scale them independently, make them fault tolerant and allow you to run them in a easy and safe manner
It's not the goal of this project to build a platform like Kubernetes, Mesos or anything close to it; as we stated before, our goal is to increase the resiliency and performance of the Liferay platform.
So, this solution will not allow you to run thousands or millions of instances of the service of your choice, nor will not allow you run multi datacenters deployments of your services. Any of them are the main goals of the project.
It is not a goal either to allow you to deploy and manage any other software like databases, messaging queues or caching systems
The very basic idea is that we're just trying to communicate two different services across the network. Despite being an easy to do task, we're entering a new brave world: the distributed systems one.
We'd like to highlight here some of the typical assumptions people make when building distributed systems
- The network is reliable.
- Latency is zero.
- Bandwidth is infinite.
- The network is secure.
- Topology doesn't change.
- There is one administrator.
- Transport cost is zero.
- The network is homogeneous.
Sadly, distributed systems are a complex beast, but we're trying to put some solid fundamentals out there, aiming to build a powerful and robust solution
Of course, we'll have to do some tradeoffs; we will try to document them and make sure all the assumptions done will have are publicly available.
The following subsections highlight the main subsystems of the solution we are planning to build. We'll try to keep this document in sync with the real implementation (every technical task created under this story will include the technical details behind it)
HTTP2 is the communication protocol of our choice. The main innovation Http2 introduces with respect to its predecessor (Http 1.1) is the separation of the connection management concerns (Layer 4) from the concerns of transmitting messages (Layer 5).
Messages can be now multiplexed, reordered and cancelled, eliminating bottlenecks and improving the performance and reliability of the application
We're using Kryo (https://github.com/EsotericSoftware/kryo).
Kryo is a fast and efficient object graph serialization framework for Java. Its main goals are speed, efficiency, and an easy to use API.
We are using the support offered by gRPC in order to achieve our load balancing goals. In this case, the gRPC Load Balancing support is to load balance every request, so, even if all the requests come from the same client, they will be still load-balanced across all the servers
gRPC´s design is based on an external load balancer entity (implementing the gRPC load balancing protocol).
We'll plug our custom load balancer so the client will be aware of the load balancing. The goal of this is to simplify the operations. The idea is that you could use an external load balancing entity in case you prefer to use this approach.
We'll be living in a "highly" dynamic world where services come and go: autoscaling, service failure, network errors, ... We need a mechanism to detect when the services are coming in and going out. As you can imagine, this topic is highly related with the previous one (load balancing). This service will help to decide the available services at the time we're doing a request.
The default implementation for the first prototype will be backed by JGroups and will be backed by a multicast implementation. We do know that this is far away of being optimal (especially the multicast part).
However, this can be useful for small installations and, specially, for getting the first version of the new runtime up and running very quickly.
Said that, we would like to build a Gossip based service discovery mechanism (more concretely based on SWIM). There's already a prototype going on.
Technically, you should be able to implement your own service discovery according to your needs (for example you service discovery could be backed by Zookeeper, Consul, ETCD, ...)
The goal of this part of the stack is to stop cascading failures, providing a way to register fallbacks and graceful degradation.
This is something we need to revisit because gRPC provides some built-in mechanisms that could be used as the foundation of this functionality. We're not sure if gRPC is going to provide something like this in the near future, so we're still exploring some options trying to make the proper decision.
One of the options would be to build something simple in top of the gRPC fundamentals and replace it once it becomes available in the main line of development The other option would be to integrate, or, at least, try to, Hystrix
We need a module in our stack that allows us to retry invocations in the presence of failures. We have to make sure that these retries happens at the time it's safe to do it.
Again, this is something we'll probably built on top of gRCP in case it is not already available. (Note: There's already a Deadline built-in function)
Once we move all our service to production they will face the reality and where we used to have one single service, now, potentially, we will have a bunch of independent services communicating to each other in a now-distributed system. Tasks like user-facing latency optimization, analysis of backed errors or communication along the different pieces of our system become more difficult than it used to be.
We will include a distributed tracing module in our stack which will help us to diagnose many of the problems our systems will face when we put them into the real world.
The default implementation will be provided by Zipkin (http://zipkin.io/) which is compatible with the OpenTracing initiative (http://opentracing.io)
Monitoring is a crucial part of every production system, and we want to provide you all the information you need in order to make the right diagnosis of your system.
All our servers and clients will incorporate a metrics registry which will try to provide all the information you need. Right now there is no available list of the predefined set of metrics that will be available.
For now, our base deployment units will be jar files where our services will be packaged. We could explore some other possibilities in the future
Our system will provide you the ability to deploy the previous deployment units across all the servers which conform the cluster.
When building distributed systems, the ideal approach is to use asynchronous message passing (in any of its variants, look to Erlang or Akka, or to some slightly different projects based on Futures like Finagle).
However, all the Liferay services and APIs are (mainly) synchronous, so it makes things more complicated (from a distributed systems' perspective). All the infrastructure explained in the previous sections tries to mitigate these problems and make the life of the people writing services much simpler.
Most of the Liferay's services are connected to a relational database and are surrounded by a transactional model (we're not going to cover Liferay's transaction model here)
However, when one of the services you're consuming is not running within the same machine your transaction boundaries are "broken".
It is not clear how we want to deal with it. The next subsection tries to summarize the different options we have on top of the table