Introducing NGenate
I'd like to introduce a new platform that I'm building with Rust that is focused on visual programming and data flow. I believe it could have use cases in several fields including Data Engineering, Algorithmic Trading and Procedural Graphics. I'm in the middle of designing and prototyping but have enough definition on the problem space that I'd like to share my ideas and where I hope to head over the next few years. I hope that this post can show how a bit of creativity might make its way into fields where it's often not present.
Please be aware that this is a side project and that it's still very early days so you may want to tune in again later for a functional prototype.
So why visual programming?
Visual programming has always fascinated me and I think a seed was planted in my former Industry when I studied a visual FX tool called Houdini. Houdini has grown so much over the decades and is such a powerful and well respected application within the VFX industry - all centered around dataflow and visual programming. So my thinking is that if Houdini was able to pull off the magic trick of getting visual programming to mainstream within its industry, could I somehow pull this off with Rust in other areas such as general data flow programming and Data Engineering?
I have a feeling that the answers are not simple, though if I was to take a guess I think that Visual programming languages (VPLs) have challenges to solve in at least these areas:
- Productivity: Professional coders feel more productive in text based environments.
- Capability: There is a feeling of being overly limited or constrained when contrasted with a general purpose programming language.
- Complexity: It's challenging to keep visual complexity in graphs under control.
- Performance: There may be a perception that VPL based programs may be substantially slower than one made with a general purpose language.
- Stigma: Visual coding is for people who don't know how to code.
I should also note that Unreal Engine's Blueprint VPL has had major success applying visual programming to general games programming and that ultimately Houdini and Unreal have shown that visual programming can gain traction in reasonably large markets.
The NGenate Platform
NGenate is my response to these questions and I've come up with an initial set of high level features that I'm going to aim toward and see what I can actually pull off. Although, at this stage it's amusing because I'm offering a product to people who may not even know they need or want it. And perhaps I'm being overly naive because I'm hoping they will like it! But in the end, if I enjoy the journey and the software, then that's enough.
Dual workflow programming
To answer Productivity and Capability challenges I'd like to create first class tooling for both coders and technical users in a kind of dual mode approach. Coders should feel productive as they will be able to switch into a full featured programming mode using a capable well established language to build up components / nodes for technical users. Conversely, technical users should feel very productive at the level of the VPL, setting up and configuring node graphs and expressions. Either type of user is free to switch between modes at any time and I'd like to make this as seamless as possible.
While Rust does not yet have a large footprint in the Data Engineering community I've decided to utilize it as the first language that users can use to create new nodes. I may support other languages in future but for now Rust is a no-brainer to kick things off as it's the same language that I'm building the platform in and I'd love to use this as a way to crack open the door to Rust a bit more in the Data Engineering community.
Note: I'm not attempting to perform round tripping with the above as what I have described above is simply a workflow and compatibility between a separate high level node and expression based VPL and Rust as the lower level language. However, I would like to support round tripping between the VPL editor and its serializable format.
Human readable / editable VPL serialization format
To further improve productivity and shareability of VPL programs I plan to design a simple serialization representation of the VPL graphs. The VPL environment / tooling should then facilitate seamless round trip switching between the serialized form and the VPL editor itself.
Inline node expression language
I'm also experimenting with a high level functional style expression language to use directly inline within the nodes to bring more capability and productivity to technical users so they won't need to dip down into the lower level more capable language unless they really need to or if that's where their headspace works better. I think that I first got inspiration for this inline expressions based approach from the VPL that Microsoft Robotics Developer Studio uses in its expression node, though what I'm trying to do will hopefully be much more capable and productive to use / type.
Keyboard as first class citizen and Auto layout
A user should be able to perform most (ideally all) navigation tasks with the VPL graph without needing to change from the keyboard to the mouse for certain tasks. This means that tasks such as creating new nodes, selecting the node type, typing node expressions and choosing inputs (wiring) and updating node layouts should all be achievable without switching to the mouse.
The mouse will still have expected support and in fact I'd like to go one further and make sure that pinch zoom and pan are supported for touch screens too. But as you've probably guessed, I'm a fan of tools like Vim (Helix is on my radar) and having first class support for the keyboard in a VPL editor could be a big productivity boost for a certain group of coders.
Compilation of VPL to Rust
This is more of a research question than a goal. I'd like to know if there is any merit to compiling the graph description into Rust code. The source code behind the lower level nodes will be compiled Rust code so any processing they do will already be efficient. However, there might be some gains in compiling the node graph to do things like setting concrete storage types for node sockets where otherwise the types of the sockets would need to be dynamically dispatched. I really don't know the answer here yet.
Cloud based distributed computation as a service
Users should have the ability to run their computations on bare metal configurations but ultimately I want computations to run and scale easily in the cloud without users worrying about infrastructure. I imagine that there is something that I can learn from tools like Rust Playground and Compiler Explorer here.
Support both realtime and batch processed compute needs
A realtime system such as a live algorithmic trading system has different needs and architectural implications when compared with the needs of batch processing large volumes of data. I'd love to be able to support both use cases which means that I might need to be pushing hard on using async Rust in an unconventional way to support either CPU bound workloads or IO bound workloads. Andrew Lamb managed to adapt Tokio by introducing a second runtime instance for Tokio to handle CPU bound tasks.
Leverage existing research and Rust projects where possible
There are many smaller and well known crates that I've used while prototyping but for now there are several that are particularly relevant to my domain that I'm yet to fully investigate that I'll list here. I'm sure this list will grow over time as I discover more great work that is already out there in the Rust community that's useful for Data Engineering.
Crates and research that I'd like to learn from
- Timely-dataflow: An interesting and strongly research-backed approach to dataflow and distributed computing. I'd like to exercise a high level of control over all the pieces in play in my own compute graphs so I may not be able to use this at first but in future I'd like to entertain the idea of potentially supporting it as a compute backend alternative to my own.
- Differential-dataflow: A differential spin on the timely-dataflow work by the same author.
Crates that I'd like to integrate with
- Apache Arrow DataFusion: DataFusion is an extensible query execution framework, written in Rust, that uses Apache Arrow as its in-memory format.
- Diesel: Diesel is the most productive way to interact with databases in Rust because of its safe and composable abstractions over queries.
The Roadmap
I'm still early in my journey but I've been busy conducting experiments and building out some early components and libraries as well as mocking up what the visual language might eventually look like.
I've built prototypes of the following to date:
- FlexStorage: Flexible data storage capabilities to meet the data input needs of a node based compute graph
- StorageTable: Allows for table / type mapped representations of storages as well as macros to create specific projections and return zipped up iterators over those projections
- NodeGraph: A data structure to represent nodes, edges and data input output sockets
- VisualEngine: A custom 2D/3D graphics engine currently written targeting WebGL but I plan to support other backends
The above prototypes are at varying levels of completeness. My next major goal is to progress those crates to a point that I can have an initial prototype of at least the visual and interactive aspects of the VPL functioning.
For the remainder of this post I'd like to share the Visual designs iterations on the VPL that I've come up with as well as the work and source code that I've put into FlexStorage - an NGenate library to power the storage needs of compute graphs / VPLs.
VPL Design Iterations
Here is my first pass on what the actual VPL expression language and graph interface could look like. These are wireframes only at this stage. A functional prototype of this (without real compute) is one of my next goals.
You will notice that as we move from version 1 to 3, the visual complexity decreases. Ideally I would like the VPL to end up looking more like Version 3 and less like Version 1. Though as I continue to prototype I know that I'll have a constant battle to find ways of coaxing visual coders toward graphs that convey an ideal level of information at a glance. The challenge is that everyone's definition of an ideal level of information varies but I'm sure this is a very similar challenge to what we face every day as Rust developers when deciding on how small to break up our functions. If you have a rule of thumb that helps you with Rust code I'd love to hear about it.
I've chosen to mock up a live algorithmic trading system in these examples, but I envisage this as a general purpose tool that can apply to all kinds of dataflow and Data Engineering domains.
Version 1
Too much visual complexity and treading too closely into being "just source code again".
Version 2
Better, no statement level code visible. This might be a candidate for the right level of information or potentially a state to which the graph can be expanded if needed.
Version 3
I love the simplicity of this and think it could make a good default level of visual detail. If a user wants to get to the Version 2's level of detail, they could expand out the "Execute Trade" node to see sub nodes within. Time and prototypes will tell if this version is too idealistic / simplistic or not.
Flex Storage
While I still haven't worked out exactly which parts of NGenate to open source, I do want to share where I can, so to start with I'd like to share my experimental Storage library called Flex Storage.
The majority of the standard lib and crate collection libraries don't have a strong focus on implementing a rich set of traits that are shared across their types, and typically don't have much included infrastructure to support a dynamic dispatch oriented workflow.
Flex Storage empowers API's to be more abstract over data by focusing on shared traits and flexible casting. It provides Storage Handles to point to either concrete Storage types or trait objects depending on if dynamic or static dispatch is needed, and casting infrastructure to perform inter trait-object casts, trait-object to type casts and type un-sizing to trait-objects.
The library was made to support a primary use case of dataflow processing within a node based visual programming environment where the intention is to be able to use storage types as inputs for processing which can be switched at runtime with interchangeable storage handles. Such a high degree of runtime flexibility comes with some added API complexity as well as performance considerations, so consider a simpler static dispatch focused workflow if most of your storage design can be determined at compile time.
This is just a github repo for now as it will be subject to a fair amount of design churn but feel free to jump in and have a look, if it's something that you find interesting let me know and I'll consider supporting this via a crate for the broader community at some point when it stabilizes.
Features
- Support for both dynamic and static dispatch though dynamic dispatch workflows have had more work.
- Flexible casting between any type or trait object within the Storage trait family.
- Can be used to hold multiple handles to the same storage where each pointer can represent the storage as a different trait object or concrete type to fit the use case. This is ideal for graph based data processing.
- Primary use case is multithreaded so all storage types and handles are Send + Sync and use Arc<RwLock
> internally within StorageHandles. - NIGHTLY + UNSAFE: The library uses a single unsafe statement to perform casting that involves an
Arc<RwLock
> and also uses a nightly only feature called ptr_metadata to help promote safety in this unsafe cast.
* Warning: This section may contain depictions of Rust in UML - Viewer discretion advised *
Example
The following demonstrates how it's possible to easily cast between different storage trait objects with the help of StorageHandles and also how the storage types items can be accessed via those different trait objects with dynamic dispatch and also static dispatch if needed.
use ngenate_flex_storage::{
storage_handle::{handle, StorageHandle},
storage_traits::{ItemSliceStorage, KeyItemStorage, Storage}, storage_types::VecStorage,
};
fn main()
{
// Create a concrete storage type
let storage: VecStorage<usize, i32> = VecStorage::new_from_iter(vec![1, 2, 3]);
// Put it into a handle to a dyn Storage trait object - the root trait of all storage traits.
let storage_handle: StorageHandle<dyn Storage> = handle::builder(storage).build();
// The handle can be used to freely cast between any of the supported supertraits
// of the Storage trait via the handles cast methods.
// Inter trait object cast to a handle to dyn ItemSliceStorage
let slice_handle: StorageHandle<dyn ItemSliceStorage<Item = i32>> = storage_handle
.cast_to_slice_storage::<usize, i32>()
.unwrap();
println!("Access the trait ItemSliceStorage trait objects items");
{
let guard = slice_handle.try_read().unwrap();
let slice: &[i32] = guard.as_item_slice();
dbg!(slice[0]);
dbg!(slice[1]);
}
// Inter trait object cast to a handle to a dyn KeyItemStorage
let key_item_handle: StorageHandle<dyn KeyItemStorage<Key = usize, Item = i32>> =
slice_handle.cast_to_getitem_storage().unwrap();
println!("Access the trait KeyItemStorage trait objects items");
{
// Get a guard to the dyn KeyItemStorage StorageHandle
let guard = key_item_handle.try_read().unwrap();
dbg!(guard.get(0).unwrap());
dbg!(guard.get(1).unwrap());
}
// It's possible to switch back to a static dispatch handle
// at any time.
println!("Iterate using static dispatch");
{
// Cast to a concrete ptr
let storage_ptr: StorageHandle<VecStorage<usize, i32>> =
key_item_handle.cast_to_sized_storage().unwrap();
let guard = storage_ptr.try_read().unwrap();
let mut sum: i32 = 0;
for i in guard.into_iter()
{
sum += i;
}
dbg!(sum);
}
}
The workflow has a focus on dynamic dispatch but I do have some ideas to facilitate user computations that automatically downcast to the correct concrete type before applying some function to each item in a storage. Ultimately I want to get more of the best of both worlds between static and dynamic dispatch but I've got a lot of experimenting that I need to do here.
After making this library I found another crate called anyinput that has come at the challenge of producing a more uniform interface for storage types from a strong static dispatch angle which looks really interesting too.
Conclusion
I've still got a long road ahead but this is roughly what I've been up to with NGenate. If you're interested in, or are working on any of these topics and want to expand your network, say hi: [email protected] - I'd love to connect with and make more friends within this space.