Hacker News Clone

Hacker News Clone new | comments | show | ask | jobs | submit | github repo

		Show HN: Gemini LLM corrects ASR YouTube transcripts (ldenoue.github.io)
		37 points by ldenoue 2 hours ago \| hide \| past \| web \| 25 comments \| favorite

wood_spirit 1 hour ago [-]

As an aside, has anyone else had some big hallucinations with the Gemini meet summaries? Have been using it a week or so and loving the quality of the grammar of the summary etc, but noticed two recurring problems: omitting what was actually the most important point raised, and hallucinating things like “person x suggested y do z” when, really, that is absolutely the last thing x would really suggest!

hunter2_ 1 hour ago [-]

It can simultaneously be [the last thing x would suggest] and [a conclusion that an uninvolved person tasked with summarizing might mistakenly draw, with slightly higher probability of making this mistake than not making it] and theoretically an LLM attempts to output the latter. The same exact principle applies to missing the most important point.

leetharris 1 hour ago [-]

The Google ASR is one of the worst on the internet. We run benchmarks of the entire industry regularly and the only hyperscaler with a good ASR is Azure. They acquired Nuance for $20b a while ago and they have a solid lead in the cloud space.

And to run it on a "free" product they probably use a very tiny, heavily quantized version of their already weak ASR.

There's lots and lots of better meeting bots if you don't mind paying or have low usage that works for a free tier. At Rev we give away something like 300 minutes a month.

baxtr 31 minutes ago [-]

Very interesting. Thanks for sharing.

Since you have experience in this, I’d like to hear your thoughts on a common assumption.

It goes like this: don’t build anything that would be feature for a Hyperscalar because ultimately they win.

I guess a lot of it is a question of timing?

tombh 43 minutes ago [-]

ASR: Automatic Speech Recognition

sorenjan 18 minutes ago [-]

Using an LLM to correct text is a good idea, but the text transcript doesn't have information about how confident the speech to text conversion is. Whisper can output confidence for each word, this would probably make for a better pipeline. It would surprise me if Google doesn't do something like this soon, although maybe a good speech to text model is too computationally expensive for Youtube at the moment.

jazzyjackson 1 hour ago [-]

Thinking about that time Berkeley delisted thousands of recordings of course content as a result of a lawsuit complaining that they could not be utilized by deaf individuals. Can this be resolved with current technology? Google's auto captioning has been abysmal up to this point, I've often wondered what the cost would be for google to run modern tech over the entire backlog of youtube. At least then they might have a new source of training data.

https://news.berkeley.edu/2017/02/24/faq-on-legacy-public-co...

Discussed at the time (2017) https://news.ycombinator.com/item?id=13768856

andai 1 hour ago [-]

Didn't YouTube have auto-captions at the time this was discussed? Yeah they're a bit dodgy but I often watch videos in public with sound muted and 90% of the time you can guess what word it was meant to be from context. (And indeed more recent models do way, way, way better on accuracy.)

jonas21 33 minutes ago [-]

Yes, but the DOJ determined that the auto-generated captions were "inaccurate and incomplete, making the content inaccessible to individuals with hearing disabilities." [1]

If the automatically-generated captions are now of a similar quality as human-generated ones, then that changes things.

[1] https://news.berkeley.edu/wp-content/uploads/2016/09/2016-08...

zehaeva 58 minutes ago [-]

I have a few Deaf/Hard of Hearing friends who find the auto-captions to be basically useless.

Anything that's even remotely domain specific becomes a garbled mess. Even watching documentaries about light engineering/archeology/history subjects are hilariously bad. Names of historical places and people are randomly correct and almost always never consistent.

The second anyone has a bit of an accent then it's completely useless.

I keep them on partially because I'm of the "everything needs to have subtitles else I can't hear the words they're saying" cohort. So I can figure out what they really mean, but if you couldn't hear anything I can see it being hugely distracting/distressing/confusing/frustrating.

jazzyjackson 1 hour ago [-]

Definitely depends on audio quality and how closely a speaker's dialect matches the mid-atlantic accent, if you catch my drift.

IME youtube transcripts are completely devoid of meaningful information, especially when domain-specific vocabulary is used.

delusional 1 hour ago [-]

That's a legal issue. If humans wanted that content to be up, we just could have agreed to keep it up. Legal issues don't get solved by technology.

jazzyjackson 1 hour ago [-]

Well. The legal complaint was that transcripts don't exist. The issue was that it was prohibitively expensive to resolve the complaint. Now that transcription is 0.1% of the cost it was 8 years ago, maybe the complaint could have been resolved.

Is building a ramp to meet ADA requirements not using technology to solve a legal issue?

delusional 1 hour ago [-]

Nowhere on the linked page at least does it say that it was due to cost. It would seem more likely to me that it was a question of nobody wanting to bother standing up for the videos. If nobody wants to take the fight, the default judgement becomes to take it down.

Building a ramp solves a problem. Pointing at a ramp 5 blocks away 7 years later and asking "doesn't this solve this issue" doesn't.

pests 28 minutes ago [-]

Yet this feels very harrison bergeron to me. To handicap those with ability so we all can be at the same level.

alsetmusic 2 hours ago [-]

Seems like one of the places where LLMs make a lot of sense. I see some boneheaded transcriptions in videos pretty regularly. Comparing them against "more-likely" words or phrases seems like an ideal use case.

leetharris 1 hour ago [-]

A few problems with this approach:

1. It brings everything back to the "average." Any outliers get discarded. For example, someone who is a circus performer plays fetch with their frog. An LLM would think this is an obvious error and correct it to "dog."

2. LLMs want to format everything as internet text which does not align well to natural human speech.

3. Hallucinations still happen at scale, regardless of model quality.

We've done a lot of experiments on this at Rev and it's still useful for the right scenario, but not as reliable as you may think.

devmor 1 hour ago [-]

Those transcriptions are already done by LLMs in the first place - in fact, audio transcription was one of the very first large scale commercial uses of the technology in its current iteration.

This is just like playing a game of markov telephone where the step in OP's solution is likely higher compute cost than the step YT uses, because YT is interested in minimizing costs.

petesergeant 1 hour ago [-]

Also useful I think for checking human-entered transcriptions, which even on expensively produced shows, can often be garbage or just wrong. One human + two separate LLMs, and something to tie-break, and we could possibly finally get decent subtitles for stuff.

icelancer 48 minutes ago [-]

Nice use of an LLM - we use Groq 70b models for this in our pipelines at work. (After using WhisperX ASR on meeting files and such)

One of the better reasons to use Cerebras/Groq that I've found so you can return huge amounts of clean text back fast for processing in other ways.

leetharris 1 hour ago [-]

The main challenge with using LLMs pretrained on internet text for transcript correction is that you reduce verbatimicity due to the nature of an LLM wanting to format every transcript as internet text.

Talking has a lot of nuances to it. Just try to read a Donald Trump transcript. A professional author would never write a book's dialogue like that.

Using a generic LLM on transcripts almost always reduces accuracy as a whole. We have endless benchmark data to demonstrate this at RevAI. It does, however, help with custom vocabulary, rare words, proper nouns, and some people prefer the "readability" of an LLM-formatted transcript. It will read more like a wikipedia page or a book as opposed to the true nature of a transcript, which can be ugly, messy, and hard to parse at times.

dr_dshiv 2 hours ago [-]

The first time I used Gemini, I gave it a youtube link and asked for a transcript. It told me how I could transcribe it myself. Honestly, I haven't used it since. Was that unfair of me?

robrenaud 1 hour ago [-]

Gemini is much worse as a product than 4o or Claude. I recommend using it from Google AI studio rather than the official consumer facing interface. But for tasks with large audio/visual input, it's better than 4o or Claude.

Whether you want to deal with it being annoying is your call.

andai 1 hour ago [-]

GPT told me the same thing when I asked it to make an API call, or do an image search, or download a transcript of a YouTube video, or...

Spooky23 1 hour ago [-]

The consumer Gemini is very prudish and optimized against risk to Google.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact