Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Show HN: Sense - A New Cloud Platform for Data Science and Big Data Analytics (senseplatform.com)
64 points by tristanz on Jan 9, 2014 | hide | past | favorite | 36 comments


Sense cofounder here. We're just getting started, so feedback is welcome.

Sense supports R, Python, JavaScript and SQL out of the box, but is fully extensible to new languages and tools:

https://github.com/SensePlatform/sense-engine

We have Julia, Hive, and Spark engines in development.


Wow, this looks really nice. One thing that isn't clear is how I get the information out. so for example, if I'm building a recommendation engine, is there some kind of api that my webapp can use to get the information (sorry, new to all this).


Right now Sense is best for ad-hoc interactive analysis and batch/scheduled jobs. You can run long-running services that expose something like a REST endpoint, but we have plans to make exposing services much easier, so I'd probably hold off until we have an "official" solution.


Looks great!

Do you have full support for numpy, scipy, matplotlib, pymc, and pandas?


Yes. I'll also point our that Anand (apatil) was a core developer for PyMC. We have big plans around Bayesian computation on Sense.


Thanks! Yes, we do, and you can install packages yourself and/or let us know if anything is missing.


Is it possible to move data around between languages / engines?

E.g. would it be possible to run a query against Red Shift (step 1), then cleanse it in Python (step 2), run R scripts over the Python output (step 3) then dump the results back to Red Shift (step 4)?

If I then decide I need to change the Red Shift query (step 1), can I then re-run the whole pipeline?

Munging data between different tools is what I seem to spend most of my time doing, so anything that helps that would be a big productivity boost.


At the moment, not really. In your scenario, you could have a Python dashboard launch a Redshift dashboard with startup code that runs the initial query, then clean the result, then launch an R dashboard and pass the clean data to it either over the shared filesystem or a messaging system such as 0MQ or Redis, then save the results to S3 for consumption by Redshift.

You're probably looking for something smoother, though. We definitely intend to have a good solution for workflows like you describe in the future.


I think you may want to think about treating the analyses like an ETL pipeline (think dependent jobs in chronos or some such) using some intermediary (S3, whatevs). That would probably be useful to a lot of analysts.


Looks great! Is it really just you and Anand that put this together?


Yup. It's just the two of us. So far. If anybody wants to join: tristan@senseplatform.com.


Looks awesome, great with multi-engine support. I'd love if you could open-source your two-pane approach to IPython... I really like IPython (and have been working with IHaskell lately), but I find the RStudio approach much better... having my code on the left, moving up an down and executing lines with Cmd+Enter, running entire cells (knitr Rmd style), seeing graphs and documentation while you're working, etc...


Glad you like it. The IPython engine isn't open source at the moment, but that may change in the future. Out of curiosity, if it were open source, what might you use it for?


For my own work, maybe help integrate it with IPython as an alternative front-end. I don't have a cloud project, I just do my own data analysis/learning with IPython and IHaskell, and think a multi-pane approach would be much more powerful. (I applied for an account with Sense, looking forward to playing with it and providing feedback).


Also see Domino Data Lab (http://www.dominodatalab.com) which is in public beta. Similar, with more emphasis on reproducibility of past results.


Very interesting! Nice work guys. Just sent you an email about academic research use case.

You don't mention anything about RAM in your pricing - what are the restrictions? And what about I/O and storage?


We're planning to charge per core for usage. Each core will be a true physical core, with 5.5 ECU. The container will get 3.75GB of RAM and a slice of the host's bandwidth per core. We'll also have very inexpensive micro tier, and eventually some kind of long-lived services tier, as Tristan mentioned in another comment.


I forgot to mention storage. We're planning on about 1GB of disk per project. Because we live on AWS, we get quick access to all of Amazon's storage services as well.


any plans for more RAM options? something like the EC2 244GB?


Not at the moment. Right now, the biggest single dashboard is 60GB. You can launch as many dashboards as your plan allows if your application can be distributed, but I'm guessing that isn't the case for you.


Hey Tristan, looks awesome. Great work!

This has some really interesting adjacencies to a project that we currently have in limited beta and getting to roll out widely very soon. I'd love to chat about some ideas I have to work together that could work out really nicely for both of us. If you're interested drop me a line: jason@applieddatalabs.com


Tristan, quick question here.. who did all coding part ? Do you still code ? Does Anand code ?

Since all of your team are very high profile ( Stanford, Harvard) I am wondering how much all have kept to ground work after rising to such level ? Thanks for your answer.

ps: I am hopeful for Stanford MBA admission.


Anand and I built everything, including the choosing the colors of buttons. That's early stage startup life.


If you think the platform is finished enough maybe post it on Kaggle: http://www.kaggle.com/ there are many potential users for this app IMHO.


Looks nice.

How well does the distributed filesystem perform and what size data sets an it handle?

How quickly can you ramp up 10, 100 or 1000 cores?

Improved performance in these areas are the big things that would get our group to adopt a new platform.


Scaling up from 10, 100, to 1000 cores is fast (3s per engine in parallel). However, something like 1000 cores would currently require spinning up new instances (1-2 minutes) if deployed in the cloud.

The distributed filesystem is meant for easily sharing code and medium sized data across containers. In the cloud, it is best to to use S3 directly for large data and local disks for high IO tasks. For on premise deployments, there are more options.

I'd be interested to hear about your use case. Feel free to drop me a line at tristan@senseplatform.com


A little off topic, but nice to see you're using Angular.


Looks a lot like GitHub. Were you inspired by it?


Yes, we're fans of how sharing and collaboration works on GitHub. The goal is to make Sense the center of gravity for data scientists the way GitHub is for developers.

We're not trying to replicate GitHub's features. The core of Sense is a better way to work with data: the compute infrastructure, engines, and analytics workflow. Advanced users using git will likely use Github in addition to Sense.


Tried to sign up, but I need an invitation code.


You currently need an invitation code to register. We're giving these out slowly to make sure everything works smoothly.


Typo : "Distributed POSIX complaint project file system"


Thanks.


SageMath Cloud has open signup:

http://cloud.sagemath.com


Is IPython notebook on the roadmap?


The Python engine is IPython underneath the hood. Any code or visualizations that work in IPython notebooks should work in Sense.

There is a difference though. In our experience, we've found that the notebook style development, with code inline, is awkward when doing serious analytics. It's harder to use version control, editors, etc. We have opted for the dual pane experience common in R and Matlab. The output however can be rich and interactive just like an IPython notebook and is always saved.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: