Hacker Newsnew | past | comments | ask | show | jobs | submit | Rutledge's commentslogin

Scorecard AI | Founding Engineer | San Francisco (Onsite) | Full-Time | https://scorecard.io

Scorecard builds simulation environments and reward models that frontier AI labs and enterprises use to train and evaluate their agents. Same discipline that made self-driving cars work, applied to LLM-based agents.

Our founding team built simulation infrastructure at Waymo, SpaceX, and Uber ATG. I'm Dare, the CEO and scaled Waymo's simulation org to 200+ engineers. We're ~7 people, 7-figure revenue, $3.75M seed led by Kindred Ventures with angels from OpenAI, Apple, Waymo, Uber, Perplexity, and Meta.

Founding Engineer ($175K-$280K + strong equity). Primarily backend. You'll own technical domains end-to-end. We use agentic tooling (Claude Code) heavily. Moving fast and working customers matters more than deep expertise in any one language.

Stack: TypeScript, Node.js, Next.js, PostgreSQL, ClickHouse, Temporal, GCP.

jobs@scorecard.io


Scorecard | Founding Engineers & GTM | San Francisco | ONSITE | Full-time | scorecard.io

Scorecard is the simulation platform for self-improving AI agents. We help teams encode expert judgment into reward models and run 10,000s of scenarios in minutes instead of reviewing 10s of production cases over weeks.

Our team built simulation systems at Waymo, Uber ATG, and SpaceX. Same discipline, applied to AI agents. Backed by Kindred Ventures and Neo, with multi-billion dollar customers.

Founding Software Engineer ($175k-$250k + equity)

Build infrastructure for large-scale agent simulation, reward model pipelines, and scenario generation. Ship fast with a low-ego team.

Stack: TypeScript, React, Node.js, Postgres. Bonus: LLM/RL experience, founder background.

https://www.scorecard.io/careers/software-engineer

Founding GTM Lead

Own demos, sales playbook, and positioning for AI teams building frontier agents. 3+ years early-stage sales/product marketing. AI/developer tools experience a plus.

https://www.scorecard.io/careers/founding-gtm

In-person in SoMa. Full benefits, daily lunch, unlimited PTO.

Apply: jobs [at] scorecard.io (mention HN)


Scorecard | Founding Engineer, Founding UX Designer, Founding GTM | SF, CA ONSITE | Full-time

Scorecard is building the leading platform for testing, evaluating, and monitoring AI applications. We help teams ship reliable AI products faster—from prototype to production. Our customers include developers and enterprises building with LLMs who need confidence their AI agents perform as expected.

We recently raised $3.75M in seed funding from Kindred Ventures, Neo, and angels from OpenAI, Google, and Meta: https://www.businessinsider.com/scorecard-raises-millions-ki...

See how we're helping enterprises like Thomson Reuters ensure their AI agents are production-ready: https://www.thomsonreuters.com/en-us/posts/innovation/from-t...

Tech Stack: Full TypeScript w/Next.js, Express, React, PostgreSQL, and agents like Claude Code/Gemini review.

We're an early-stage, fast-growing team tackling the most pressing problems in the AI reliability space. If you're excited about being a founding team member at a company defining how the industry evaluates and optimizes AI systems, we'd love to hear from you.

Open Roles:

- Founding Software Engineer: Build the core platform that helps developers test and evaluate AI agents at scale - Founding UX Designer: Design intuitive experiences that make complex AI evaluation accessible to all developers - Founding GTM: Help define and execute our go-to-market strategy as we scale with customers

Learn more and apply: jobs@scorecard.io w/ subject 'HN'


I call them 'CLI agents'!


Here's the image from Wayback: https://web.archive.org/web/20250625051706/https://blog.goog...

The biggest diffs from Claude code (the current champion): 1. Generous free tier (60 RPM!) 2. Open Source Apache (Standard after OAI Codex did the same)


Hi HN- we're excited to launch the first remote MCP server for claude.ai and cursor for LLM evaluation. Would love your thoughts and feedback :)




This initiative is designed to be community-driven, so we're looking forward to your feedback on what agent benchmarking needs exist in your domains. While starting with legal AI, we plan to expand across industries where benchmarks for AI agents evaluation are needed.


The concurrent request handling seems great for our AI eval workloads, where we're waiting for LLM API calls and DB operations but curious how Vercel handles potential noisy neighbor issues when one request consumes excessive CPU/memory?

Disclosure: CEO of Scorecard- AI eval platform, current Vercel customer. Intrigued since most of our time serverless time is spent waiting for model responses, but cautious about 'magic' solutions.


We built Fluid with noisy neighbors(=requests to the same instance) in mind. So because we are a data-driven team, we

1. track metrics and have our own dashboards to ensure we proactively understand and act whenever something like that happens 2. also use these metrics in our routing to smartly know when to scale up. we have tested a lot of variations of all the metrics we gather and things are looking good

anyway, the more workload types we will host with this system, the more we know and the better/performant it will get. we're running this for a while now, and it shows great results.

there's no magic, just data coming from a complex system, fed into a fairly complex system!

hope that answers the question, and thanks for trusting us


So if undertood 1. correctly I could use this solution to potencially save money, but it could turn into a nigthmare very quickly if you guys aren't watching?


Yes quite helpful- thanks for explaining and will try it out!


i think the majority of Vercel customers are doing web site hosting & most of the web requests are IO bound so it makes sense to handle multiple requests per microvm.

can't say the same if customer is doing CPU bound workload.


This is great :) and pretty impressive that it was possible in coda!


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: