Tech Info 77

Tech Info 77 ๐Ÿ“š **Welcome to Dharia Knowledge** โ€“ Where *deep insights* meet *curiosity*! We offer a broad range of knowledge to keep your curiosity alive!

we provide knowledgeable videos about science and technology.Science and technology revolutionize our lives, but memory, tradition, and myth frame our response." Weโ€™re here to serve you fascinating scoops of knowledge across a variety of exciting and popular niches. Whether you're passionate about business, space, human science, or history, Dharia Knowledge has something for every curious mind.

๐Ÿ’ผ

**Business & Finance**: Explore the world of **entrepreneurship**, **investing**, and **cryptocurrency**. Learn practical strategies in **personal finance** and discover the secrets to financial success.

๐ŸŒŒ **Space & Astronomy**: Take a journey through the **cosmos** with mind-blowing discoveries about **black holes**, **NASA missions**, and the ever-expanding universe. Learn how space exploration is shaping the future.

๐Ÿง  **Human Knowledge & Science**: Dive deep into topics like **neuroscience**, **psychology**, and **artificial intelligence**. From **quantum computing** to the **human brain**, we cover the cutting-edge advancements in science and technology.

๐ŸŒ **History & Culture**: Travel through time with stories of **ancient civilizations**, **world history**, and the evolution of **human culture**. Discover the historical events and figures that have shaped our modern world.

๐Ÿ”ง **Technology & Innovation**: Stay up to date with the latest in **AI**, **robotics**, **blockchain**, and **futuristic technology**. We break down the complex world of innovation into fun, digestible content.

๐ŸŒฑ **Health, Wellness & Environment**: Learn about **mental health**, **fitness**, **nutrition**, and **sustainable living**. Stay informed about the latest breakthroughs in health and environmental science.

๐ŸŽฌ **Pop Culture, Philosophy & Beyond**: Expand your understanding of **modern philosophy**, **psychological insights**, and **popular culture**. At **Dharia Knowledge**, we believe learning should be enriching, exciting, and accessible to everyone. Join us and dive into the depths of knowledge, one topic at a time. With every video, we aim to make learning a delightful and delicious experience โ€“ just like enjoying your favorite ice cream cone on a hot day. ๐Ÿง

10/14/2024
Is it feasible to teach an AI just using data produced by another AI? It may seem like a foolish notion. However, it is ...
10/13/2024

Is it feasible to teach an AI just using data produced by another AI? It may seem like a foolish notion. However, it is one that has been around for a while, and it is becoming more and more popular as fresh, accurate data becomes harder to come by.

Anthropic trained Claude 3.5 Sonnet, one of its flagship models, using some synthetic data. Using data produced by AI, Meta improved its Llama 3.1 models accordingly. Furthermore, it has been reported that OpenAI is using o1, its "reasoning" model, to provide artificial training data for the impending Orion.

However, why exactly does AI require data, and what kind of data does it require? Can synthetic data truly take the place of this data?

Annotations' significance for AI systems

Artificial intelligence systems are statistical apparatuses. After being exposed to a large number of samples, they are trained to recognize patterns in the examples, such as the fact that "to whom" usually comes before "it may concern" in an email.

Key components in these examples are annotations, which are typically text labels that describe the contents or meaning of the data these systems consume. They act as markers, "teaching" a model how to discriminate between objects, locations, and concepts.

A photo-classifying model is presented with numerous images of kitchens that are labeled with the word "kitchen." The model will start to associate the word "kitchen" with generic features of kitchens as it trains (e.g. that they have fridges and counters). Following instruction, after being shown a picture of a kitchen that wasn't in the original samples,

It should be detectable as such by the model. (Of course, labeling the kitchen photos as "cow" would identify them as cows; this highlights the significance of accurate annotation.)

The market for annotation services has expanded dramatically due to the demand for AI and the requirement for labeled data to support its development. According to Dimension Market Research, it is currently valued at $838.2 million and will increase in value to $10.34 billion over the next ten years. Although exact figures about the number of persons involved in labeling work are unknown, a 2022 paper places the figure in the "millions."
Businesses of all sizes depend on the labor of data annotation companies to provide labels for AI training sets. Certain tasks pay decently, especially if you need specific knowledge (math skills, for example) to categorize the items. Some people can give you back pain. In impoverished nations, annotators typically receive meager pay of a few dollars per hour, with no benefits or assurances of additional work.
A desiccating data set
Thus, there are humanitarian justifications for looking for alternatives to labels that are created by humans. However, there are also useful ones.

Labeling is a human limitation. Biases of annotators can also be seen in the annotations they make and, consequently, in the models that are trained on them. Annotators label instructions incorrectly or stumble over them. And it costs money to hire people to accomplish tasks.
For that matter, data is expensive in general. Reddit has made hundreds of millions of dollars from licensing data to Google, OpenAI, and other companies, while Shutterstock is charging AI suppliers tens of millions of dollars to access its archives.

Finally, obtaining data is likewise getting more difficult.

The majority of models are trained on vast sets of publicly available data, which owners are increasingly preferring to fence out of concern that their work will be copied or that they won't be given credit for it. OpenAI's web scraper is currently blocked on more than 35% of the top 1,000 websites worldwide. Additionally, a recent study discovered that the main datasets used to train models had restricted about 25% of the data from "high-quality" sources.
The research company Epoch AI predicts that between 2026 and 2032, developers will run out of data to train generative AI models, should the current trend of access limiting continue. This has made AI vendors face up to date, along with worries about negative content finding its way into public data sets and copyright disputes.

Artificial substitutes
Synthetic data would seem to be the answer to all of these issues at first appearance. Do you require annotations? Make them. More illustrative information? Not an issue. There are no limits.

And this is true to some degree.

According to Os Keyes, a PhD candidate at the University of Washington who focuses on the ethical implications of developing technologies, "if 'data is the new oil,' synthetic data pitches itself as biofuel, creatable without the negative externalities of the real thing," as reported by Dharia Knowledge . "You can simulate and extrapolate new entries from a small starting set of data."
The AI sector has embraced the idea wholeheartedly.

This month, the enterprise-focused generative AI startup Writer unveiled Palmyra X 004, a model that was nearly exclusively trained on artificial intelligence data. According to Writer, it only cost $700,000 to develop, but estimates for an OpenAI model of a similar size are $4.6 million.

A portion of the training data for Microsoft's Phi open models came from synthetic data. Likewise, Google's Gemma models were. Hugging Face, an AI firm, just released what it claims to be the largest AI training dataset of synthetic language, while Nvidia announced a model family this summer that is intended to provide artificial intelligence training data.

The creation of synthetic data has grown into a separate industry that, by 2030, may be valued at $2.34 billion. According to Gartner, synthetically generated data will account for 60% of the data utilized in AI and analytics initiatives this year.

Senior research scientist at the Allen Institute for AI Luca Soldaini pointed out that training data can be produced using synthetic data techniques in a manner that is difficult to obtain through scraping (or even content licensing). For instance, Llama 3 was used by Meta to generate captions for film in the training data for Movie Gen, a video generator. Humans then improved the captions to include more information, such as lighting descriptions.
In a similar vein, OpenAI claims to have optimized GPT-4o through the use of synthetic data in order to create ChatGPT's sketchpad-like Canvas functionality. Furthermore, Amazon has acknowledged that in order to enhance the real-world data it uses to train Alexa's speech recognition models, it also creates fake data.

"Human intuition of what data is needed to achieve a specific model behavior can be quickly expanded upon with synthetic data models," Soldaini added.

Artificial dangers
But synthetic data is not a cure-all. Like all AI, it has the "garbage in, garbage out" issue. Models produce synthetic data, and their outputs will also be polluted if the data used to train them contains biases and limits. For example, with the synthetic data, the groups that were underrepresented in the original data will still be there.

Keyes stated, "The issue is, there's only so much you can do." Let's say your dataset contains just 30 Black individuals. Extending the analysis could be beneficial, but if those thirty individuals are all middle-class or fair-skinned, that is what the "representative" data will all look like.

As of right now, an excessive dependence on synthetic data during training can result in models whose "quality or diversity progressively decrease," according to a 2023 study conducted by experts at Stanford and Rice Universities. After a few generations of training, the researchers found that sampling bias, or inadequate representation of the real world, exacerbates a model's lack of diversity (though they also discovered that incorporating some real-world data helps to alleviate this).

Complex models like OpenAI's o1 pose extra hazards, according to Keyes, as they may result in more difficult-to-identify hallucinations in their synthetic data. This could also lower the accuracy of models trained on the data, particularly if it is difficult to pinpoint the cause of the hallucinations.

Keyes went on, "Complex models have hallucinations; the data generated by complex models contains hallucinations." Furthermore, "the developers themselves can't always explain why artefacts appear with a model like o1."

Hallucinations compounded together can produce models that speak incoherently. A research article in the journal Nature describes how models that are trained on data that contains errors produce even more errors, and how this feedback loop deteriorates the models that are produced in the future. Over generations, models lose their ability to comprehend more sophisticated information, the researchers observed, becoming more generic and frequently generating answers that have nothing to do with the questions posed.
see images
Additional research reveals that other kinds of modelsโ€”such as image generatorsโ€”are also susceptible to this kind of collapse.

At the very least, if the intention is to prevent creating uniform image generators and forgetful chatbots, Soldaini concurs that "raw" synthetic data shouldn't be trusted. According to him, in order to use it "safely," you should carefully examine, choose, and filter it. You should also ideally match it with new, authentic data, just like you would with any other dataset.

Failure to do so may eventually result in model collapse, a situation in which a model's outputs become increasingly biased and less "creative," ultimately jeopardizing the model's usefulness. There is a chance that this process will be discovered and stopped before it becomes dangerous.
Failure to do so may eventually result in model collapse, a situation in which a model's outputs become increasingly biased and less "creative," ultimately jeopardizing the model's usefulness. There is a chance that this process will be discovered and stopped before it becomes dangerous.
According to Soldaini, "researchers need to review the generated data, improve the generation process, and find safeguards to eliminate low-quality data points." "The output of synthetic data pipelines needs to be thoroughly examined and refined before being utilized for training; they are not a self-improving machine."

One day, according to OpenAI CEO Sam Altman, AI will generate synthetic data of sufficient quality for it to be able to teach itself. But the technology isn't there yet, assuming that's even possible. Not a single large AI lab has released a model that was only trained on synthetic data.
It appears that human intervention will be necessary in some capacity to ensure that a model's training proceeds smoothly, at least in the near future.

10/11/2024

AI Text-to-Video is here / Faceless Youtube Videos at 81% Automation


Send a message to learn more

10/10/2024

Could Artificial Intelligence Reveal the Secrets of Animal Conversation? | Ai is a Future

technology and artificial intelligence are revealing how animals communicate and may soon allow us to talk to them ourselves. What can we learn from wildlife? And should we be talking to the animals at all?

Do you want to know what your future holds? A life beyond 150 years old? A world where computers can read our emotions? A planet transformed by unlimited clean energy

10/09/2024

The Evolution of Artificial Intelligence: 10 Key Stages to Understanding AI

12/24/2023

Are You Know ๐Ÿ™† Where,s From Start Technology ? Lets Ready To Know
AL-JAZRI A MUSLIM SCIENTIST ๐Ÿฅฐ
Father of Robotics
Ismail al-Jazari
al-Jazari

Born 1136 CE
Jazira, Artuqid Dynasty[1]
Died 1206 CE
Religion Islam
Era Islamic Golden Age

More details w8 for our first long video

Address

1071 Santa Rosa Plaza, Santa Rosa, CA 95401
California City, CA
96162

Website

Alerts

Be the first to know and let us send you an email when Tech Info 77 posts news and promotions. Your email address will not be used for any other purpose, and you can unsubscribe at any time.

Share