Wow, this looks really nice. One thing that isn't clear is how I get the information out. so for example, if I'm building a recommendation engine, is there some kind of api that my webapp can use to get the information (sorry, new to all this).
Right now Sense is best for ad-hoc interactive analysis and batch/scheduled jobs. You can run long-running services that expose something like a REST endpoint, but we have plans to make exposing services much easier, so I'd probably hold off until we have an "official" solution.
Is it possible to move data around between languages / engines?
E.g. would it be possible to run a query against Red Shift (step 1), then cleanse it in Python (step 2), run R scripts over the Python output (step 3) then dump the results back to Red Shift (step 4)?
If I then decide I need to change the Red Shift query (step 1), can I then re-run the whole pipeline?
Munging data between different tools is what I seem to spend most of my time doing, so anything that helps that would be a big productivity boost.
At the moment, not really. In your scenario, you could have a Python dashboard launch a Redshift dashboard with startup code that runs the initial query, then clean the result, then launch an R dashboard and pass the clean data to it either over the shared filesystem or a messaging system such as 0MQ or Redis, then save the results to S3 for consumption by Redshift.
You're probably looking for something smoother, though. We definitely intend to have a good solution for workflows like you describe in the future.
I think you may want to think about treating the analyses like an ETL pipeline (think dependent jobs in chronos or some such) using some intermediary (S3, whatevs). That would probably be useful to a lot of analysts.
Looks awesome, great with multi-engine support. I'd love if you could open-source your two-pane approach to IPython... I really like IPython (and have been working with IHaskell lately), but I find the RStudio approach much better... having my code on the left, moving up an down and executing lines with Cmd+Enter, running entire cells (knitr Rmd style), seeing graphs and documentation while you're working, etc...
Glad you like it. The IPython engine isn't open source at the moment, but that may change in the future. Out of curiosity, if it were open source, what might you use it for?
For my own work, maybe help integrate it with IPython as an alternative front-end. I don't have a cloud project, I just do my own data analysis/learning with IPython and IHaskell, and think a multi-pane approach would be much more powerful. (I applied for an account with Sense, looking forward to playing with it and providing feedback).
We're planning to charge per core for usage. Each core will be a true physical core, with 5.5 ECU. The container will get 3.75GB of RAM and a slice of the host's bandwidth per core. We'll also have very inexpensive micro tier, and eventually some kind of long-lived services tier, as Tristan mentioned in another comment.
I forgot to mention storage. We're planning on about 1GB of disk per project. Because we live on AWS, we get quick access to all of Amazon's storage services as well.
Not at the moment. Right now, the biggest single dashboard is 60GB. You can launch as many dashboards as your plan allows if your application can be distributed, but I'm guessing that isn't the case for you.
This has some really interesting adjacencies to a project that we currently have in limited beta and getting to roll out widely very soon. I'd love to chat about some ideas I have to work together that could work out really nicely for both of us. If you're interested drop me a line: jason@applieddatalabs.com
Tristan, quick question here.. who did all coding part ? Do you still code ? Does Anand code ?
Since all of your team are very high profile ( Stanford, Harvard) I am wondering how much all have kept to ground work after rising to such level ? Thanks for your answer.
Scaling up from 10, 100, to 1000 cores is fast (3s per engine in parallel). However, something like 1000 cores would currently require spinning up new instances (1-2 minutes) if deployed in the cloud.
The distributed filesystem is meant for easily sharing code and medium sized data across containers. In the cloud, it is best to to use S3 directly for large data and local disks for high IO tasks. For on premise deployments, there are more options.
I'd be interested to hear about your use case. Feel free to drop me a line at tristan@senseplatform.com
Yes, we're fans of how sharing and collaboration works on GitHub. The goal is to make Sense the center of gravity for data scientists the way GitHub is for developers.
We're not trying to replicate GitHub's features. The core of Sense is a better way to work with data: the compute infrastructure, engines, and analytics workflow. Advanced users using git will likely use Github in addition to Sense.
The Python engine is IPython underneath the hood. Any code or visualizations that work in IPython notebooks should work in Sense.
There is a difference though. In our experience, we've found that the notebook style development, with code inline, is awkward when doing serious analytics. It's harder to use version control, editors, etc. We have opted for the dual pane experience common in R and Matlab. The output however can be rich and interactive just like an IPython notebook and is always saved.
Sense supports R, Python, JavaScript and SQL out of the box, but is fully extensible to new languages and tools:
https://github.com/SensePlatform/sense-engine
We have Julia, Hive, and Spark engines in development.