mastodon.ml — это один из многих независимых серверов Mastodon, которые вы можете использовать для участия в сети Fediverse.
Русскоязычный сервер социальной сети Mastodon. Зона общения, свободная от рекламы и шпионажа, теперь и на русском языке.

Управляется:

Статистика сервера:

678
активные пользователи

Update: Thanks @pitermach showing a great demo that it's actually Mist World Upsampling to 48 in this demo, not NVDA downsampling to 16!
I stitched together an audio file showing you how bad it is at ignoring the setting of -1 as the output. Instead #NVDASR tries to be too smart, enumerate the list and gather which you have set as your sound mapper output, and explicitly call that sound device when passing to the TTS outputs.
I updated this to add a little more at the end and show how Mist World treats audio output switching properly, that I now know is not proper.
Good night, Mastodon. This really ruined my weekend at first, until that amazing demo in my mentions by @pitermach clarified things. :)
Update: People are asking, "how can I tell?" Listen for the sharpness of S's and other consonants. If you have the ear you'll notice.

00:00/01:20

@Tamasg What you're hearing isn't actually downsampling to 16, it's aliasing artifacts introduced by whatever resampling algorithm Mistworld's audio library is using. Vocalizer actually runs at a native 22 KHZ as far as I know. I recorded a quick demo of what it sounds like when you bring a 22 KHZ file to 48 KHZ with a low quality resampling algorithm versus a file that's actually at 16 KHZ.

00:00/04:25

@Tamasg @pitermach and... if you want this aliasing, use Zdsr. I am trying to argue with people that this aliasing is bad. No, they know better, because they have more experience, and stuff like that... If i see that it cannot be hear for a long time... i stop using it.

@asael @pitermach yeah, I do think that the new R8Brain algo mentioned there does a lot better job at still making the voice have that higher crisp quality but not so much the sharpness on the actual consonants, which to me feels like the best of both worlds. You get a lot higher quality to the ear but also don't get those weird artifacts the older ways of upsampling to 44 or 48 introduces.
I don't think people who don't like it are wrong, especially for some minds, sharper noises like that in the audio can really stand out and become annoying or a headache.

@Tamasg @pitermach Because of this harsh aliasing, i start to loss concentration while listening to speech. I don't know, but RHVoice uses its original sample rates IN NVDA or using sapi. also a screen reader which is not doing good with sample rates is jfw.

@asael @pitermach I wonder if we'll ever get a true TTS that's not 22050 but true 44.1K sampling rate. Now I think I'm on the hunt for that. My guess is the newer AI voices might be the first of their kind this way if so. It's interesting because 22050 in actuality is 11025. The Nyquist frequency (or Nyquist limit) refers to the highest frequency that can be accurately represented by a given sample rate. It is half of the sample rate. The reason for this is due to how digital sampling works: you need at least two samples per cycle of the waveform to fully capture its shape. This is known as the Nyquist-Shannon sampling theorem. So really, any TTS claiming to be 22050 HZ is really just 11025, and any TTS claiming to be 11025 is just 5.5 K-hertz, youch

@Tamasg @asael We had one, for a very brief moment, Innoetics. But they got bought out by Samsung. I think there's also a bulgarian TTS that runs at 44, saw someone use it once but not sure what it's called.

@pitermach @Tamasg @asael I have (had?) the Innoetics John voice. It's probably still installed on my old laptop. It was kind of odd and weirdly concatenated, but definitely had highs that nothing else did.

@BorrisInABox @pitermach @asael OMG. This was a thing? What year did that all get created? Feel like I've missed like, a major milestone in speech history. LOL. It may be that using actual 44K data is too large in size, so that's why even the human voices by companies such as Nuance stuck with 22050, as it's tolerable enough to not be an AM radio but not high quality to be even like an FM signal could be.

@BorrisInABox @Tamasg @pitermach but, when it comes to innoetics, their greek voices are a thing

@asael @BorrisInABox @Tamasg Believe it or not I still have it here, yay for a Windows install on its 9th year now. Fun fact, that John voice was created from the voice of Jon St. John, aka the guy who voiced Duke Nukem, sadly not doing the Duke voice in this case. But that single reason is why we used it for a very long time either for reading chats or games while streaming with @talon. Here's a quick demo. Not a super responsive voice, but yeah there's a lot of highs.

00:00/01:14

@pitermach @asael @BorrisInABox @talon wow yeah, that dfor sure has some lag, although have heard worse. Sounds more like what you'd get on a Piper voice. But yeah, the quality in this is for sure what I like about the up-sampled Vocalizer, and there's none of that sharpness crap at the ends of certain letters people can notice, wow.

Zvonimir Stanecic

@Tamasg @pitermach @BorrisInABox @talon btw, there is something new about upcoming NVDA. The chinese dev, who is developing natural sapi adapter, started making improvements to speech protocols. He is going to add the silence trimming for all tts, making them responsive.

@asael @Tamasg @BorrisInABox @talon Oh neat. That's definitely something that's giving ZDSR a decent performance edge over NVDA with some SAPI voices it makes a very noticeable improvement.

@pitermach @Tamasg @BorrisInABox @talon and... silence removal will be there in NVAD, too. planned for 2025.1