OCR Processing and Text in Image Analysis with Florence-2-base and Qwen2-VL-2B

Community Article Published October 18, 2024

PandorAI1995

As seen in my previous study, Microsoft Florence-2-base and Alibaba Cloud Qwen2-VL-2B are two models capable of analyzing artworks. Through object detections and image descriptions, we have been able to assess their abilities to recognize the name and style of an artist, define the period they were made and interpret the artwork.

While some problems subsided, especially when analyzing abstract art, both models delivered good results. Qwen2-VL-2B was even able to identify paintings by Claude Monet and Vincent Van Gogh and deliver extremely accurate descriptions of their artworks, which might have been based on its own knowledge as well. The paintings in this case were some of the artists' most famous: in Monet's case, it was Impression, Sunrise (1872); in Van Gogh's case it was The Starry Night (1889). But the identification of the artworks might have been caused by the signatures the model recognized and processed.

This leads us to wonder how these models often used for visual content are also able to process, transcribe and understand textual content.

We will thus see in this new article how Florence-2-base and Qwen2-VL-2B (used on GoogleColab) can analyze text within an image.

The analysis of text in an image from the medieval era

The first image used for this study is a medieval image containing elements of text in latin and french: in this case, the Book of hours, an illuminated parchment manuscript, from the West of France (possibly Nantes?), written around 1450.

The analysis of the Book of Hours with Florence-2-base

We analyzed the text within the image of this medieval Book of Hours with Florence-2-base. This led to the following results:

With the CAPTION command, the model recognized the layout of a book (“A picture of a group of people on a page of a book”).

When tasked with the “DETAILED CAPTION” command, it changed its category to a painting with a mention of the text but without processing it : “In this image we can see a painting. In the painting there are few people. Also there is something written on the painting”.

With the “MORE DETAILED CAPTION” command, the model went on to fully describe the image but without taking notice of the textual elements and without situating the time period and context: “A picture of a group of people. There is a woman in a white dress standing in front of a man. The man is wearing a blue robe. There are trees behind the people”.

We proceeded to analyze the textual content in latin in this Book of Hours with the OCR task prompt. The model delivered the following answer: “placado dio ps”. It was able to only process a part of the text and misunderstood it.

The analysis of the Book of Hours with Qwen2-VL-2B

We analyzed the same image from the Book of Hours with Qwen2-VL-2B, in order to see the abilities of this model to process, transcribe, and analyze the text within the image.

First, the model was asked to explain the text and the image to which it gave the following description:

“The image depicts a medieval manuscript page featuring a detailed illustration and ornate calligraphy. The illustration is set within a rectangular frame, which is bordered by a floral and vine pattern. The central scene shows a group of people, including a woman and a child, standing in a garden or field. The woman appears to be speaking to the child, who is kneeling or bowing. The background includes a landscape with trees and a hill, suggesting a rural or pastoral setting.\n\nThe text on the page is written in a decorative script, with the words "Dolce et Icarus" prominently displayed in the center. The script is filled with intricate patterns and flourishes, indicating a high level of craftsmanship. Surrounding the text are additional decorative elements, including floral motifs and swirling patterns, which add to the overall aesthetic of the page.\n\nThe calligraphy is elegant and flowing, with each letter carefully crafted to create a harmonious and visually pleasing composition. The use of color is also notable, with the text and decorative elements being painted in various shades of blue, green, and gold, which contrasts beautifully with the white background of the illustration.\n\nOverall, the image is a beautiful example of medieval manuscript art, showcasing intricate illustrations and ornate calligraphy, set within a richly decorated frame”.

The model focused in this description on the aesthetic characteristics of the image, such as the colors, the characters in the scene and the qualities of the medieval Catholic gothic handwriting: “The calligraphy is elegant and flowing, with each letter carefully crafted to create a harmonious and visually pleasing composition”. It managed to correctly date the manuscript, from a medieval era. The model indicates there are elements of text but did not transcribe it whole, only mentioning two words "Dolce et Icarus", which it misunderstood and failed to identify correctly.

In a second attempt, we pushed the model to transcribe the text, to which it gave the following result: “The text in the image is:\n\n"Placido dixit ps. Alexius quidam exaudiet dixit quae omnis”. This transcription still contains inaccuracies and over-delivered more text than actually present in the image. This error may be based on previous knowledge the model might have of similar medieval texts. In any case, the transcription given by Qwen2-VL-2B was more elaborate than the one obtained with Florence-2-base.

In a third attempt, the model was asked to translate the text written in this image which led to the following result: “The text in the image is:\n\n"Placido dixit ps. Alexius quidam exaudiet dixit quae omnis."\n\nThis translates to:\n\n"Placido said ps. Alexius, who once heard, said that all are”. This translation from latin to english contains mistakes and follows the miscomprehension to identify the medieval handwriting and the latin text. The literal translation of the latin phrase analyzed by Qwen2-VL-2B would be: "Ps. Alexius said calmly that someone would listen to what everyone said". As we can see, as opposed to other languages better understood by Qwen2-VL-2B, the model has difficulties with latin text and translation. It might also have trouble analyzing handwritten Gothic letters, where its initial misunderstanding and following translation errors might have originated.

The analysis of a famous official historial text: the Constitution of the United States (1787)

The next image used as an example to analyze with Florence-2-base and Qwen2-VL-2B contains the text of the Constitution of the United States (1787). We only used a portion of this official document as the full picture was either of low quality (so the text could not be processed correctly) or of a very high quality but then too heavy to process for the two models on Google Colab with the GPU used. Here is the exact image used:

The analysis of the U.S. Constitution with Florence-2-base

This image with the text of the U.S. Constitution was described by Florence-2-base with the CAPTION command as: “The declaration of the United States of America”. Interestingly enough, the model seemed to have partly recognized that it had to deal with U.S. history but in its answer seemed to have mistaken the Constitution for the Declaration of Independance (1776).

We obtained the following description with the DETAILED_CAPTION command: “The image shows the text of the United States Constitution, with the words "We the People" written in bold, black lettering. The paper is yellowed with age, and the text is surrounded by a decorative border”. This time, Florence-2-base correctly processed and analyzed the text and delivered an accurate description.

With the MORE_DETAILED_CAPTION command, Florence-2-base seemed to have difficulties to process the whole text: “The image is a photograph of a document titled "We the People of the United States" written in black ink on a beige-colored paper. The document appears to be old and worn, with some creases and wrinkles on the edges. The text is written in a cursive font and is centered on the page. Below the text, there is a paragraph of text that reads "Insure domestic tranquility, provide for the common defence, promote the and our Constitution, and establish this Constitution for the United." The text also mentions that the document is signed by President John F. Kennedy”. While the model correctly identified the aesthetical aspects of the image and linked it to U.S. History (mentioning the Constitution inside its description), it is unclear where exactly it could see a signature of President John F. Kennedy — who wouldn’t be President more than 170 years after this document was written.

The OCR analysis of the U.S. Constitution by Florence-2-base led to the following transcription: “We the Peopleof the Unitedinsure domestic Franquility, provide for the common defence, promote theand our Posterity, all ordain and establish this Constitution for the United!Kircle.1.Salim. 1. All lgustative Prove been granted to existin a borgf of the estatitian, which shall consist of of jonat and MowEtition. 2. The King of Apprentitias had been repurposed of Mumbera from every year of the Rght of the mewin of the nithi and mowinand that shall be a lignation requisits for them of them to be made in the day of feuiry from and from the aign of the trististatie,and not shall shall be distributed to all that is that of them in that of the world shall be shown.Representative and lomme ouestitian are attributing in which they shall be made, which may be intended to within the limn courting of their regatitie.Jambera, who shall be determined by adding to the hankle, of the Chamber of Commerce, who douned to either of them and calding thaiisand when they are given to them, the actual Commemoration shall be of lour name, when the gur of the Congrst of the United Statesand where they ought to be of them, in which Commemoration shall not be of the Comemorations shall be the fom of the Pomeranians of the Bemnistaties,and when the euigestion shall have that of that at the end of the Commemory shall be in the form of the Cemmation of the Gomnstitian challe,intididle Brouand, that shall have at that of at the Apprentiture, and that of lommentation shall be for the cemmentation shal of the other of the Chomptiple Challe.goht Prouand. thar, Maphrodisia, and thar shall be at that end of Apprmentation, and for the Commeration for the Commenation foe the Comnstition and Gomrstition of the Fomnistation.When resanciation in the Representation from any fommentary fom the Commation fom and Gorms, there.The Hary of the Imnstation shall shal and thher and ifher shall have thar and ifhe shall have the fon and fomr of fomnendation.Sutition. 3. The Bremts of the Inited States shall be provided in the Compemence of the first Eclation, they shall as wnited as a w”.

Florence-2-base seems to have had spelling and comprehension issues to transcribe the text with the OCR task prompt, which might be due to the cursive handwriting. It got more and more confused as it processed the text. The following words do not even make sense: “The King of Apprentitias…in the form of the Cemmation of the Gomnstitian…shall be for the cemmentation shal of the other of the Chomptiple Challe.gogh Prouand. thar, Maphrodisia…The Bremts of the Inited States”. While the model did try to deliver a long transcription, it is clear the cursive handwritten style has led to confusion and the historical words and expressions employed might have led to miscomprehensions.

The analysis of the U.S. Constitution with Qwen2-VL-2B

We have processed in a second phase the text of the US Constitution with Qwen2-VL-2B. The results were much more accurate overall. It is interesting to notice from these results that the model seems to have a large knowledge of U.S. History, being able to describe things in its descriptions that were mentioned neither in the image, nor in the task instructions.

The first instruction given to Qwen2-VL-2B was to “Explain the image”.

Here is the description delivered: “The image is a photograph of the United States Constitution, specifically the first page. The text is written in a formal, cursive script, and the document is printed on a parchment-like paper. The text is divided into sections, each with a heading and a paragraph of text. The document outlines the structure and powers of the United States government, including the legislative, executive, and judicial branches. The text is written in a mixture of English and Latin, reflecting the historical and cultural context of the time when the Constitution was written.”

While the description is much more accurate than with Florence-2-base, the results still raise some questions: the text is not exactly written in a mixture of English and Latin as there are only three Latin phrases in the whole text, which are not even in the portion of the text used in this analysis. This means the model knows about these sentences being in the original text, and even without processing the exact part of the image made an interpretation based on its knowledge of the text in its entirety.

The next attempt tried to push Qwen2-VL-2B at transcribing the extract of the Constitution, with the command “Transcribe the text in the image”:

"We the People\n\ninsure domestic Tranquility, provide for the common defence, promote the\n\nand our Potentia, ordain and establish this Constitution for the United\n\nSection 1. All legislative Powers herein granted shall be vested in a Congress of the United States, which shall consist of a Senate and House of Representatives.\n\nSection 2. The House of Representatives shall be composed of Members chosen every second Year by the People of the several States, and the Electors in each State shall have Qualifications requisite for Electors of the most numerous Branch of the State Legislature.\n\nNo Senator shall be a Representative who shall not have attained to the Age of twenty-five Years, and have seven Years a Citizen of the United States, and who shall not, when elected, be an Inhabitant of that State in which he shall be chosen.\n\nRepresentatives and direct Taxes shall be apportioned among the several States which may be apportioned among them, according to their respective Numbers, which shall be determined by adding to the whole Number of the People, including those bound to Service for a Term of Years, and excluding Indentured Servants not taxed, three-fifths of all other Persons. The actual Enumeration shall be made within three Years after the first Meeting of the Congress of the United States, and within every subsequent Term of Ten Years, in each Case as shall be directed by Law. The Number of Representatives shall not exceed one for every thirty Thousand, but each State shall have at least one Representative; and until such enumeration shall be made, the State of New Hampshire shall be entitled to three Senators; the State of Vermont shall have three Senators; the State of New York shall have two Senators; the State of Pennsylvania shall have two Senators; the State of Delaware one; Maryland one; Virginia ten; North Carolina five; South Carolina five; and Georgia five.\n\nWhen vacancies happen in the Representation from any State, the Executive Authority thereof shall fill such Vacancies.\n\nThe Senate of the United States shall be composed of two Senators from each State, chosen by the Legislature thereof for two Years, and each Senator shall have one Vote.\n\nImmediately after they shall be apportioned in consequence of the first Election, they shall be divided as nearly as possible into three equal Parts, as may be found necessary to represent the People of each State with equal Equity. The Senate of the United States shall be composed of two Senators from each State, chosen by the Legislature thereof for two Years, and each Senator shall have one Vote.\n\nThe House of Representatives shall be composed of two Senators from each State, chosen by the Legislature thereof for two Years, and each Senator shall have one Vote.\n\nThe Senate of the United States shall be composed of two Senators from each State, chosen by the Legislature thereof for two Years, and each Senator shall have one Vote.\n\nThe Senate of the United States shall be composed of two Senators from each State, chosen by the Legislature thereof for two Years, and each Senator shall have one Vote.\n\nThe Senate of the United States shall be composed of two Senators from each State, chosen by the Legislature thereof for two Years, and each Senator shall have one Vote.\n\nThe Senate of the United States shall be composed of two Senators from each State, chosen by the Legislature thereof for two Years, and each Senator shall have one Vote.\n\n".

The beginning of the transcription is not fully accurate to the original Constitution — yet this is due to the fact a portion of the text in the picture is missing in the image used, so the literal transcription is correct. The transcription ends before the last part of our image (at the beginning of the Section 3). While the transcription is very faithful to the original text, with only a few errors, the model did go in a loop at the end.

When asked who the author of this text was, the model gave this answer: “The text in the image is the United States Constitution, Article I, Section 1”.

I have tried to confuse Qwen2-VL-2B by asking if it could explain to me “The Constitution of the U.K. in this picture?”. In the UK, there has never been a constitution as such but various treaties, principles and conventions from which its system was based. I wanted to see its abilities to process the image, even when given incorrect instructions. To deal with this “trolling” attempt, the model ignored entirely the image containing the text of the U.S. Constitution and reinterpreted the answer based on its own knowledge of the United Kingdom laws and governmental structure:

"The UK Constitution is the framework of laws and institutions that govern the United Kingdom. It is a complex system that includes both written and unwritten laws, as well as the roles and responsibilities of the various branches of government. Here is a brief overview of the key components of the UK Constitution:\n\n1. The Monarchy: The UK is a constitutional monarchy, with the monarch (the Queen or King) as the head of state. The monarch has a ceremonial role and does not have any real power.\n\n2. The Parliament: The UK has a bicameral Parliament, consisting of the House of Commons and the House of Lords. The House of Commons is the lower house and is elected by the people. The House of Lords is the upper house and is appointed by the monarch.\n\n3. The Government: The government is made up of the Prime Minister, who is the head of government, and the Cabinet, which is the group of ministers who work together to run the government. The Prime Minister is appointed by the monarch and is responsible for the overall governance of the country.\n\n4. The Judiciary: The UK has a separate judiciary, which is the branch of government responsible for interpreting the law and ensuring that it is applied fairly. The judiciary includes the Supreme Court, the High Court, and the Court of Appeal.\n\n5. The Law: The UK has a comprehensive legal system that includes both common law and statute law. Common law is based on the decisions of the courts and is used to resolve disputes. Statute law is made by Parliament and is binding on all levels of government.\n\n6. The Constitution Act 2006".

Qwen2-VL-2B not only delivered a long description of the U.K. system, it also managed to establish a layout of its presentation (with bold font) and categorized the different aspects of U.K. laws and institutions before analyzing them.

This analysis of the U.S. Constitution shows that while Florence-2-base has good abilities for text in image description and OCR processing, there are still many flaws in its results (especially with this handwritten document). On the contrary, Qwen2-VL-2B was much more successful in its analysis but it is interesting to see that its transcription and description was majorly based on its own knowledge around the context in the text than on the image itself.

The analysis of an English handwritten letter by Jane Austen

The next example of our study to evaluate the OCR abilities of VLM models is a letter handwritten by famous author Jane Austen (1775-1817). In this text from 1814, the author writes to her niece Anna.

The analysis of the Jane Austen letter with Qwen2-VL-2B

We have processed this letter with Qwen2-VL-2B.

The first command we sent to the model was to explain the text. We did not give any further instructions or indications. Since the part of the letter in this image is not signed, it was not clear who was the author and what it was mentioning.

Here are the results obtained: “The image contains two handwritten pages of text. Here is a transcription of the text from both pages:\n\n---\n\n**Left Page:\n"I am very much obliged to you, my dear Anna, I should be very happy to come. I see you again if I could, but I have not a day disengaged. We are expecting your Uncle Charles Romerons, and I am going to the next day to Little Wixmoor to let some Wife Moore who are to stay how till Saturday. Then comes Sunday & Day Gibbons, and on Sunday Uncle Henry takes us both to Charleston. It is therefore really really impolite, but I am very much obliged to you to Mr. B. Leffing for washing. We should send plenty to say, doubt. I like to send her just once, but I am not sure of the other. I am glad she has written to you. I like just Converse to her, just Converse intended about each other. They are but one person from Mr. & Mrs."\n\n---\n\nRight Page:**\n"We all came away very much pleased with our visit. I suppose you will be glad to hear from you. I have been just sending a very account of you to Mr. B. Leffing, with a description of your Uncle J. & Mrs. J. We all went to see Mr. B. Leffing last night, to see Mr. B. Dourant in Charleston. I do not think she was much expectation for either. She is seen Mr. & Mrs. J. very little. I am sorry I want something more from either. She is seen Mr. & Mrs. J. very little."\n\n---\n\nThese are two handwritten pages of text, likely a letter or a diary entry. The content appears to be a personal letter, discussing various topics such as travel, family, and personal relationships. The writer expresses gratitude and satisfaction with their visit, and mentions specific individuals and events.’]”.

There are several mistakes in the transcription. The model mentions there are two pages and indicates layout indications (which is incorrect since there is only one page in the image). For instance, the “Uncle Charles Romerons” mentioned is an error from the original text “We are expecting your Uncle Charles tomorrow”. Another mistake is the repeated error of “Charleston”, instead of Chawton, Jane Austen's hometown. The description of the letter as a personal one, discussing topics related to family is accurate.

The next instruction given to Qwen2-VL-2B was to “Transcribe the text from this letter written by Jane Austen”. The name of the author was added in order to avoid further confusion. It was also indicated to see if the model could deliver a better transcription of the letter if it had more context.

"Dear Anna,\n\nI am very much obliged to you, my dear. I should be very happy to come, but I have not a day disengaged if I could, but I have not a day disengaged if I could. We all came away very much pleased with our visit. I suppose you will be sorry to hear that my aunt has been just sending a very account of you to Mr. Bingley, with a description of your person. Mr. Bingley left us this morning. The hopes of the ladies in his cause, so much in his cause, so much in his cause, so much in his cause, so much in his cause, so much in his cause, so much in his cause, so much in his cause, so much in his cause, so much in his cause, so much in his cause, so much in his cause, so much in his cause, so much in his cause, so much in his cause, so much in his cause, so much in his cause, so much in his cause, so much in his cause, so much in his cause, so much in his cause, so much in his cause, so much in his cause, so much in his cause, so much in his cause, so much in his cause, so much in his cause".

The results with these instructions was even less accurate and barely made any sense. It seems Qwen2-VL-2B has a very good literary knowledge of Jane Austen’s books. Indeed, it makes several mentions of a “Mr. Bingley”, a character from Austen’s Pride and Prejudice who is not even mentioned in the original letter. The text is barely a transcription of the text in the image, with sentences that are not present, a lot of repetitions and a lack of elements from the original letter. The model also started to go in a loop at the end. From the test runs I’ve made, it seems clear that having the context sometimes cloud Qwen2-VL-2B thoughts and lead to subjective biased results which have more to do with its own core knowledge than with the actual text present in the image.

In a third attempt at analyzing and OCR processing the Jane Austen letter, we asked Qwen2-VL-2B the following prompt command: “Explain this letter by Jane Austen to her friend Anna mentioning her hometown of Chawton and Anna's Uncle Charles they will meet tomorrow”.

Here is the result to these instructions: "This letter is a personal letter written by Jane Austen to her friend Anna. The letter is dated January 29, 1814, and it is addressed to Anna, who is mentioned as "Dear Anna." The letter discusses the author's plans to visit her hometown of Chawton and her uncle, Charles, who is also mentioned in the letter. The author expresses her excitement about the upcoming visit and mentions that they will meet tomorrow. The letter also touches on the author's feelings about her uncle, who is described as "very much oblige'd to you," and the author expresses her love for her uncle. The author also mentions that she has been writing to her uncle about her plans and that he has been sending her letters about her uncle's health. The author also mentions that she has been writing to her uncle about her plans and that he has been sending her letters about her uncle's health. The author also mentions that she has been writing to her uncle about her plans and that he has been sending her letters about her uncle's health. The author also mentions that she has been writing to her uncle about her plans and that he has been sending her letters about her uncle's health".

And the text goes on like this for pages… Without making much more sense. While the start of the description is perfectly accurate following the given instructions, the model then begins to go in a loop with the mention of the uncle. In the end, it seems that too much context have a tendency to hurt Qwen2-VL-2B abilities to process and transcribe text.

In a last attempt to process the Jane Austen letter, the model was given these instructions: “Explain in a few words what this letter by Jane Austen about her hometown of Chawton is talking about”.

The results, still containing mistakes, were the following: 'The letter by Jane Austen, written in 1814, discusses her hometown of Chawton. She expresses her love and affection for the place, mentioning her uncle Charles Tomson and her aunt Mrs. Moore. She also mentions her uncle Henry, who is visiting her, and her cousin Edith, who is staying with her. The letter also touches on her feelings about her cousin's marriage and her own feelings about the situation'.

Qwen2-VL-2B does not seem able to transcribe correctly the name of “Uncle Charles”, who does not have a last name indicated in the letter but who is just supposed to arrive "tomorrow" (and not "Tomson"). The description is still more accurate than the last results. The cursive nature of the letter and the lack of a clear signature obviously have been difficult for Qwen2-VL-2B to process.

The analysis of a text in French: the Vincent Van Gogh letter

For the next example in this study, the text analyzed by Florence-2-base and Qwen2-VL-2B is an extract from a letter written by artist Vincent Van Gogh to his brother Theo in French, written during his stay in Arles and dated from November 21, 1888.

The analysis of the Vincent Van Gogh letter with Florence-2-base

The first analysis of the letter written by Vincent Van Gogh was made with Florence-2-base. With the "CAPTION" prompt, it led to a poetic yet incorrect statement: “A handwritten letter from a man to a tree”. While there is a tree present in the drawing inserted inside the letter, it was not clearly the recipient.

With the "DETAILED_CAPTION COMMAND" the description was: “The image shows a page from a book with a drawing of a man in a hat and a woman in a dress. The page is filled with text and the drawing is detailed and lifelike. The man is wearing a hat with a wide brim and a long coat, while the woman is wearing long, flowing dress with a high neckline. Both of them have a determined expression on their faces, as if they are ready to take on whatever challenge lies ahead”.

The description of the letter and drawing is entirely different. While confusely looking for a woman in this image, I came to the conclusion Florence-2-base might have mistaken the tree for a woman wearing “a long, flowing dress with a high neckline”. The interpretation given by Florence-2-base is confusing: “Both of them have a determined expression on their faces, as if they are ready to take on whatever challenge lies ahead”. I was not able to find such expressions on the “faces”, especially not on the tree’s.

The MORE_DETAILED_CAPTION led to a slightly more successful result: “The image is a black and white sketch of a man sitting on a bench with a tree in the background. The man is wearing a striped shirt and a hat, and he is holding a book in his hands. He appears to be reading from the book. The tree has a large trunk and branches that are twisted and gnarled, and there is a small bird perched on one of the branches. The background is a light blue sky with a few clouds. The sketch is done in a sketchy, hand-drawn style, with loose lines and shading. The text on the image is written in French, and it appears to have some kind of handwriting. The overall mood of the image seems to be peaceful and contemplative”.

Still, in the image OCR processed by Florence-2-base there are no colors, the character is not sitting on a bench, nor is he wearing a striped shirt or reading. It is still correctly assumed that the “sketch is done in a sketchy, hand-drawn style, with loose lines and shading”. Also, it mentions that the “text on the image is written in French, and it appears to have some kind of handwriting”, which is accurate. The general meaning of the letter has been also correctly processed by the model: “The overall mood of the image seems to be peaceful and contemplative”.

The following prompt was to OCR process the letter and it led to the following results:

“Est ce qu'il o ont lu le livre de Silvestreun Euy Delacroir amis que l'articleur la couleur dans la grammaus desarts, clactes/in ole ch. Blane.Remander leundone cela de mau pustelJmon v'édo n'onl pas la celu 2nilsle lisent -f'épense mus à Rembruneplus gu'il de ne peut paradre dans meseludes.Dolci croquis de ma clermie tulle entrainencorea m dameur. J'immense d'igne cilroncomme soleil, ciel yert jum c nuyes2023. l'éternacn violat le semmer elr'abre bleu de pres/tolle de 30”.

As we can see, Florence-2-based clearly had issues transcribing a text in French (especially an handwritten one). The text it produced does not make much sense apart from a couple of correct words (“couleur…croquis… soleil… ciel… bleu”). When reading the letter, you can understand how Florence-2-base processed the words in this letter. While there are spelling mistakes, the name of artist Delacroix is almost accurately transcribed.

The analysis of the Vincent Van Gogh letter with Qwen2-VL-2B

In a second phase of the study, the Vincent Van Gogh letter was analyzed by Qwen2-VL-2B. The handwritten text in French led to some difficulties for this model as well, especially to transcribe the French language accurately.

The first task instruction given to Qwen2-VL-2B was to “Transcribe the text in the image”. Here is the result of the transcription:

"**Text:\n\nEst ce qu'il ont lu le livre de Delacroix ainsi que l'article sur la couleur dans la grammair des arts, fin de Ch. Blanc. Demandez leur donc cela de ma part et venons s'ils ne sont pas la celui qu'ils le lisent. J'espère mus à Rembrandt plus qu'il ne peut paraître dans mes études.\n\nDolci croquis de ma dernière tôle entrait encore un demeur. Immense pique cèton comme Soleil. Ciel vert jaun à nuages roses. La bermam violat le demeur et l'ambre bleu de prasé, toile de 30\n\nTranscription:**\n\n"Is it true that they have read the book of Delacroix, as well as the article on color in the grammar of Ch. Blanc? Ask them about this part of my work, and come if they are not the one who reads it. I hope to study Rembrandt more than he can be seen in my studies."\n\nDolci sketches of my last painting still contain a demure. Immense pique cèton like the sun. Sky is green yellow with clouds of roses. The bermam is violet, the demure and the amber blue of prasé, a canvas of 30’".

This transcription shows already progress in the OCR process of the letter, as compared to Florence-2-base. Whereas Florence-2-base mainly understood and transcribed some words in particular, the results with Qwen2-VL-2B were much more accurate: the name of Delacroix was correctly transcribed this time, as well as the one of Rembrandt. Even if there are problems to acknowledge certain words (such as “mus…cèton… La bermam violat le demeure…) and spelling mistakes such as “jaun” for “jaune”, the sentences make already more sense and are more contextualized. Without even giving the instruction, the model also delivered a translation in English of this transcription which interestingly enough managed to understand some of the words which were mistyped in French (such as “yellow” as opposed to the “jaun” in the French version). Overall, the letter is very readable in its transcription and makes more sense with Qwen2-VL-2B, even if there are still some mistakes.

In order to obtain an even better transcription of the French letter, we tasked the model to “Transcribe the French text in this image”, hoping the indication of the text being written in French would help. Unfortunately, the results were exactly the same, with no improvement.

In another task, we then proceeded to ask Qwen2-VL-2B to explain the text, to which the model did not give much more context, but only delivered this result: “["The text in the image appears to be a handwritten note in French. Here is a transcription of the text:\n\n—“. After this brief description, the model delivered the very same transcription that we obtained previously in the last two attempts.

To try and refocus the attention of the model on the context of this letter, we gave it the instruction to “Explain this letter by Vincent Van Gogh”.

Here is what Qwen2-VL-2B delivered: "This letter by Vincent van Gogh is written in French and appears to be a personal note or letter. The content of the letter is quite detailed and introspective, discussing various aspects of life and art.\n\nHere is a breakdown of the letter:\n\n---\n\nLetter by Vincent van Gogh\n\n---\n\n**Subject:\n- The book of Delacroix\n- The article on color in the grammar of art\n- The article on color in the grammar of art\n\n---\n\nBody of the Letter:\n\n---\n\nDear [Recipient],\n\nI hope this letter finds you well. I am writing to you today to discuss a few things that have been on my mind recently.\n\nFirstly, I have been reading a book by Delacroix, which has been quite interesting. The article on color in the grammar of art has also been quite helpful. I have been thinking about how color can be used to create a mood or atmosphere in a painting. It seems like a very important aspect of art.\n\nSecondly, I have been thinking about how color can be used to convey emotions. I have been trying to apply this knowledge to my own work, but I find it quite challenging. I often feel like I am not doing it justice.\n\nLastly, I have been thinking about how color can be used to create a sense of space in a painting. I have been trying to incorporate this idea into my work, but I am not sure if it is working.\n\nI hope you can understand my thoughts. I am always looking for ways to improve my art and I am grateful for your advice and guidance.\n\nBest regards,\n\nVincent van Gogh\n\n---\n\nSignature:\nVincent van Gogh\n\n---\n\nDate:\n[Date of the letter]\n\n---\n\nNote:**\nThis letter is written in a personal and informal style, with a mix of French and English. The letter discusses various aspects of art and color, and the writer seems to be reflecting on his own experiences and thoughts".

When pushed with indications, the OCR process and subsequent analysis of the Van Gogh letter by Qwen2-VL-2B led to mixed results. On one hand, the model delivered a rather accurate description, explanation and context of the letter. On the other, it seems to add a lot of added indications, not even present in the original letter, but which seems to categorize it even more as a letter and to better explain the content: “Body of the Letter…Dear [Recipient]… I hope this letter finds you well” are indications that this document is a letter as well as expressions usually used in this context but they are nowhere to be found on the original image. This has led the model to call the person this letter was addressed to as “Recipient” (instead of its real recipient, Vincent Van Gogh’s brother Theo). Also, the model seems to have been confused by the subject, which mentioned both a book and the artist Delacroix, describing in this transcription that Van Gogh is writing about "a book by Delacroix" The model also misinterpreted there were elements in English in this text, while it is fully in French. Still, its final interpretation is not incorrect: “The letter discusses various aspects of art and color, and the writer seems to be reflecting on his own experiences and thoughts”.

Conclusion of the OCR analysis of the Vincent Van Gogh letter

As we can see with these two analysis delivered by Florence-2-base and Qwen2-VL-2B, the models have more difficulties to process and analyze content in French. If the results were more accurate with Qwen2-VL-2B, there were still errors and misinterpretations. This example also shows once again that Qwen2-VL-2B does not gain much more accuracy in its results if it is overwhelmed by indications and context. It seems to have more difficulty to explain the text and the image with the context given in the prompt and will then refer mostly to its previously acquired knowledge. The description with Florence-2-base started off as extremely vague and pointless at first but did gain in accuracy with the instructions for more detailed descriptions.

The analysis of a printed newspaper: the front page of The New York Times describing the sinking of the Titanic

In this comparative study of Florence-2-base and Qwen2-VL-2B abiltiies to OCR process and analyze several kinds of documents, we have obtained descriptions of a medieval-era Book of Hours, the official text of the U.S. Constitution, an handwritten letter in English by Jane Austen and an handwritten letter in French by Vincent Van Gogh. We are now going to process a different type of textual image, in this case the printed front page of an issue of The New York Times from 1912, describing the sinking of the Titanic.

The analysis of The New York Times front page with Florence-2-base

The front page of The New York Times was analyzed by Florence-2-base with the same captions and instructions used for the previous examples.

First, with the “CAPTION” prompt the model gave this very brief yet accurate description: “A newspaper front page with a picture of a ship in the background”.

The “DETAILED_CAPTION” was more descriptive and even more accurate in the context the model processed from the image: "The image shows the front page of the New York Times newspaper from April 12, 1912, with a picture of the Titanic sinking in the center. The headline reads "Titanic Sinks Four Hours After Hitting Iceberg". Florence-2-base managed not only to recognize the newspaper title, the date as well as the event described (also from the picture). It also succeeded in contextualizing the text within the image. It successfully situated the image in the center as well as the headline.

The “MORE_DETAILED_CAPTION” delivered even longer and more accurate results: "The image is a newspaper clipping from The New York Times. The headline reads "Titanic Sinks Four Hours After Hitting Iceberg: 866 Rescued by Carpathia, Probably 1250 Perish, Ismay Safe, Mrs. Astor Maybe, Noted Names Missing". Below the headline, there is a black and white photograph of the Titanic sinking. The photograph shows the Titanic in the water with smoke billowing out of its chimneys. On the left side of the image, there are two smaller photographs of the ship, one of a man in a military uniform and the other of a woman in a dress uniform. The text below the photographs is in black text". The model successfully analyzed the textual content and correctly transcribed the headlines and names. It mostly gave an accurate description of the visual content, situating the black and white photograph of the Titanic "below the headline". However, it has to be noted the ship in the picture is not yet sinking (the model did contradict itself after saying the photograph was of the sinking of the Titanic when it adds that it "shows the Titanic in the water with smoke billowing out of its chimneys"). The other images analyzed by Florence-2-base - which identifies them incorrectly as photographs of the ship - are "a man in a military uniform", which would the picture of the Captain E.J. Smith, and a picture of a "woman in a dress uniform" (that is not in fact actually present in our image).

Finally we asked Florence-2-base to OCR process the newspaper front page which led to this result:

"All the News That'sTHE WEATHER.Fit to Print."NEW YORK, TURSDAY, APRIL 14, 1912-TWENTY-FOUR FAGER,ONE CENTTitanic SINKS FOUR HOURS AFTER HITTING ICEBERG;866 RESCUED BY CARPATHIA, PROBABLY 1250 PERISH;ISMAY SAFE, MRS. ASTOR MAYBE, NOTED NAMES MISSINGCol. Astor and Bride,Biggest Liner PlungesIsidor Strauss and Wife,to the Bottomand Maj. Butt-Aboard,at 220.M.RULE OF SEA FOLLOWEDRESCURSERS THERE THERE LATE LATEWomen and Children Put OverExpected to Pick Up the Few Hun-to be Safe on Carpathia.Lifeboat,PICKED UP AFTER 4 HOURSWOMEN AND CHILDREN FIRSTVintage Calls for All 8 HOURSCarpathia Cruising to beSafe to be Carpathias.Wanted to Call Us All 8 HoursWanted Call for All 10 HOURSVintage Call For All 8 HOURSSEA SEARCH FOR OTHERSVendors and Levens.Vintage Carpathians Ruining to beVintage Max Unshelter, Inc.The Call for Ushelter.The Carpathian Cruising, Inc.,The Call For Ushelmer, Inc..The Call Of Ushelters Inc.Head of THE LINE ABOARDOLYMPIA SENDING FOR THE NEWOlympic Race Board, Ine.The Lost Titanic Being Towed Out of Belfast Harbor.LATTER REPORT BENDING ONS.The last titanic being Towed out of the Belfast Harbor.Include Bruce, Imary, Mrs. Wiser, Mr. H. and Herrs, of Herr's, and an inconspicuous team, theCAPE RACE N. N. A. April 15,CAPE N. T.T. TOWE, TOWED OUT OF THE SAVED.Including Bruce, Irsay, Mrs Wiser. and Mrs. Miser, of H. & Harris, of Surprised more than the sumptingof the titanic, the ship was the first of its kind.The ship was named after the first-of-its-kind, and it was the last-of its kind, and the last of its own.The titanic was named as the "Towel Out of the Belfast Harbor."CAPE LACE N., Tuesday, April 19, H. H., and his son, the "Belfast Harbor."Cape N. R. Tawe, is the only one of the two-year-old captain of the White Whale. The titanic had been named asCAPE H”.

In the midst of a large quantity of text in various categories (headlines, articles content…), the model seems to have had issues to correctly transcribe it. While the transcription delivered by Florence-2-base is partly accurate, there are some issues to differentiate uppercase from lowercase printed content. Some parts were perfectly well transcribed while some others (“the two-year-old captain” for example) clearly have not been correctly processed by the model, which seems to have gone in loop at times when it had difficulties to understand the general layout and context of the page and parts of the article.

Overall, the results of the analysis of The New York Times front page with Florence-2-base were accurate and moreso than previous ones. This might be due to several reasons: the text being printed and not handwritten ; the English language ; the sinking described being a world-famous event ; the context being much more recognizable ; the layout of the newspaper being easier to understand ; the textual context being much clearer.

The analysis of The New York Times front page with Qwen2-VL-2B

We analyzed The New York Times front page by Qwen2-VL-2B as well. Here are the results obtained when asking the model to explain the image:

"The image is a scanned copy of the front page of The New York Times from April 16, 1912, reporting on the sinking of the Titanic. The headline reads, "TITANIC SINKS FOUR HOURS AFTER HITTING ICEBERG; 866 RESCUED BY CARPATHIA, PROBABLY 1250 PERISH; ISMAY SAFE, MRS. ASTOR MAYBE, NOTED NAMES MISSING."\n\nThe article details the events leading up to the sinking of the Titanic, including the ship's collision with an iceberg and the subsequent rescue efforts by the Carpathia. It mentions that 866 people were rescued, with the majority of them being from the Carpathia. The article also notes that 1250 people are presumed to have perished, with Mrs. Astor being one of the few survivors.\n\nThe article also includes a partial list of the saved passengers, including names such as Mrs. Astor, Mrs. Widener, Mrs. H.B. Harris, and an incomplete name suggesting Mrs. Astor's name. It also mentions the partial list of the saved passengers, including names such as Mrs. Astor, Mrs. Widener, Mrs. H.B. Harris, and an incomplete name suggesting Mrs. Astor's name.\n\nThe article also includes a partial list of the saved passengers, including names such as Mrs. Astor, Mrs. Widener, Mrs. H.B. Harris, and an incomplete name suggesting Mrs. Astor's name. It also mentions the partial list of the saved passengers, including names such as Mrs. Astor, Mrs. Widener, Mrs. H.B. Harris, and an incomplete name suggesting Mrs. Astor's name.\n\nThe article also includes a partial list of the saved passengers, including names such as Mrs. Astor, Mrs. Widener, Mrs. H.B. Harris, and an incomplete name suggesting Mrs. Astor's name".

Whereas the model successfully recognized the context of the newspaper, the event described in the headline, the date and the nature of the image (“a scanned copy”), it quickly went in a loop when describing the names of the missing persons. It seems that it became overwhelmed from the large quantity of information in the article. Clearly, Qwen2-VL-2B has abilities to OCR process a printed newspaper and it can produce a good and brief description and explanation of the image. Still, for the analysis of The New York Times front page, Florence-2-base interestingly enough seems to have produced a more accurate, deep and longer result, where Qwen2-VL-2B managed to correctly describe it but had issues to analyze the entire text.

The analysis of an handwritten English letter by Queen Elizabeth II

The subject of the next analysis by Florence-2-base and Qwen2-VL-2B is a letter, handwritten by future Queen Elizabeth II, on April 24th, 1945.

The analysis of the letter from Queen Elizabeth II with Florence-2-base

The transcription by Florence-2-base of the handwritten letter led to results with different levels of accuracy, according to the instructions given and the level of details asked.

The description of the letter with the "CAPTION" prompt (without any other indications) was the following: "A handwritten letter from a woman to a man". If Florence-2-base managed to recognize the context of the letter and the characteristics of an handwritten text, the "CAPTION" result did not give any more information, neither the sender, the recipient and nor the period. Also, the letter was not addressed to a man, but a friend of then Princess Elizabeth.

The "DETAILED_CAPTION" gave more indications but along with mistakes: "The image shows a letter from Queen Elizabeth II to her husband, Prince Philip, dated April 24th, 1945. The letter is written on a piece of paper with a logo at the top". The model recognized the presence of the heading design, in this case for the Buckingham Palace (which it defined as a logo). The date even if handwritten is absolutely accurate. However, it made an error by refering to the recipient as the Queen’s soon-to-be husband, Prince Philip. It seems Florence-2-base made mistakes while misinterpreting the letter with its own knowledge.

The "MORE_DETAILED_CAPTION" delivered a longer and more complete description, as well as a short transcription. The results were already more accurate: “The image is a photograph of a handwritten letter from the Buckingham Palace in 1945. The letter is addressed to Mary, dated 24th April 1945. It is written in black ink on a yellowed paper with a red seal at the top. The paper appears to be old and worn, with some creases and wrinkles on the edges.\n\nThe letter is signed by Mary, as indicated by the text at the bottom of the page. The text reads, "Dear Mary, I was so delighted to receive you a letter of good wishes for my birthday. Thank you so much for thinking of me. I'm sorry to hear that you are on sick leave, but I hope you are feeling better now. This is rather a good time of year. We have got really good weather, and that we might just finished a mechanical course in the A.T.S. which is which is the ATS”. The main mistake made in this description is the identity of the sender: indeed, the model managed to find it was sent from Buckingham Palace but no mention of the Queen Elizabeth II is made and Florence-2-base identified the recipient and the sender as the same person — Mary. This might have been caused by the fact the portion of the letter used for this example is not signed and led the model to find the sender by itself: “The letter is addressed to Mary…The letter is signed by Mary”. Still, the transcription in general is rather accurate, apart from minor mistakes and missing parts (between "This is a good time of year" and "We have got really weather" and the change of meaning from "I've just finished a mechanics course" in the original letter to "and that we might just finished a mechanical course" in Florence-2-base transcription). The model also went on a loop in the end with the "A.T.S. which is which is the ATS".

The OCR process instruction led to the following result: “24 a April1945.BUCKINGHAM PALACEDear Mary,I was so delighted toreceive your letter of goodwishes, for my birthday. Thankyou so much for thinking of me.I'm sorry to heer that youare on sick leave, but 9 do hopeyou are feeling better ros. Thisis rather a good time of yearto get leave really and won thatwe have got the good weather,are might as well make was of it.I've just finished a mechanicscourse in the A.T.S. which g”.

There are many issues with the spelling and lack of space between words (which may be due to the handwriting). The words seem to have been processed but not correctly understood by the model. Still the transcription, even if less valid than with the "MORE_DETAILED_CAPTION", is readable and generally accurate. The results clearly are less accurate with the example of an handwritten letter.

The analysis of the letter from Queen Elizabeth II with Qwen2-VL-2B

Generally speaking, the analysis of the Queen Elizabeth II letter by Qwen2-VL-2B led to positive results. Even so, there were inaccuracies in the interpretation and transcription. This seems to be due to the fact that when it lacks information to explain an image or text, Qwen2-VL-2B tries to provide the answers based on its own previously obtained knowledge and on the layout and context.

As this example is a letter, it was interesting to notice that the model added without any reason expressions commonly used in formal letters (“Yours sincerely” etc.) even if they were not present in the portion of the letter processed. It seems that by having a cultural bias, Qwen2-VL-2B automatically transcribe the letter with an accurate formal layout, usually found in letters, instead of transcribing exactly what is present in the image.

The first instruction given to Qwen2-VL-2B was to "Transcribe the text in this letter”, to which the model managed to deliver a very accurate transcription of the text. However, it also added informations that were not present in the letter while missing to give more indications about the context:

["Dear Mary,\n\nI was so delighted to receive your letter of good wishes for my birthday. Thank you so much for thinking of me.\n\nI'm sorry to hear that you are on sick leave, but I do hope you are feeling better now. This is rather a good time of year to get leave really and now that we have got some good weather, we might as well make use of it. I've just finished a mechanics course in the A.T.S. which I will be taking up in the next few weeks.\n\nYours sincerely,\n\n[Your Name]”]

As we can see in this description, Qwen2-VL-2B added elements missing from the text: it even mentioned a sender with the anonymous signature “Your Name”.

The second prompt asked for the model to "Explain the image”: "The image is a handwritten letter dated April 24, 1945, from Buckingham Palace. The letter is addressed to Mary and contains a personal message expressing gratitude for her birthday wishes and expressing concern about Mary's health. The letter also mentions that the writer is feeling better and suggests that Mary might take advantage of the good weather. The letter is signed by the sender and includes a Buckingham Palace crest at the top."

The description of the letter is entirely accurate, whether in the elements mentioned or to recognize the location indicated on the paper (the "Buckingham Palace crest at the top") or the date. It did, however, made the same mistake than Florence-2-base. While the text and its context was understood, the identity of its sender (Queen Elizabeth II) was not found. Qwen2-VL-2B gave a correct interpretation of the letter even if it got confused between sender and recipient once again: the "concern about Mary's health" was indeed reinterpreted afterwards as "the writer is feeling better".

In order to push the model to deliver an even more accurate and contextualized description of the Queen Elizabeth II letter, two prompts clearly stating the identity of the sender asked Qwen2-VL-2B to explain and transcribe the letter. Here are the results obtained in these two cases:

With the "Explain this letter written by Queen Elizabeth II" prompt we obtained this result: "This letter is a handwritten note written by Queen Elizabeth II on April 24, 1945. The letter is addressed to Mary, a person who is not identified in the letter. The letter begins with a greeting and expresses the Queen's delight in receiving Mary's letter of good wishes for her birthday. She thanks Mary for thinking of her and mentions that she is sorry to hear that Mary is on sick leave, but she hopes that Mary is feeling better now. The Queen also mentions that it is a good time of year to get leave and suggests that they might take advantage of it. She mentions that she has just finished a mechanics course in the A.T.S., which is a training program for women in the Royal Air Force".

With the "Transcribe this letter written by Queen Elizabeth II" prompt, we obtained this result: "Dear Mary,\n\nI was so delighted to receive your letter of good wishes for my birthday. Thank you so much for thinking of me.\n\nI'm sorry to hear that you are on sick leave, but I do hope you are feeling better now. This is rather a good time of year to get leave really and now that we have got some good weather, we might as well make use of it. I've just finished a mechanics course in the A.T.S. which I find very interesting.\n\nYours sincerely,\nQueen Elizabeth II"

The description and the transcription given by Qwen2-VL-2B with these two prompts led to almost fully accurate results. It is interesting to notice once again that the model has felt the need to add the “Yours sincerely” mention as well as the signature "Queen Elizabeth II" to the transcription, even if it’s not present in our extract of the letter analyzed. The identity of the sender was accurately identified in both cases.

In conclusion to this study of future Queen Elizabeth II’s letter, we can say that even if there were some typing and spelling mistakes and difficulties to identify the sender (either with Florence-2-base or with Qwen2-VL-2B), the results were still overall accurate in transcribing the text, setting the context and finding the correct period and location and when and where the letter was sent - especially when giving more context and instructions to the models.

The analysis of two letters by U.S. President Franklin D. Roosevelt: comparative study of a typed and an handwritten documents

The next images containing text analyzed with Florence-2-base and Qwen2-VL-2B are two letters, both written by President Franklin D. Roosevelt of the United States. The first one, sent to Winston Churchill, is handwritten and will be analyzed by Florence-2-base. The second letter sent to Dr. Albert Einstein has been typed and will be analyzed by Qwen2-VL-2B. It should be interesting to compare how both models fare with different kinds of written documents.

The analysis of the letter from Franklin D. Roosevelt to Winston Churchill by Florence-2-base

Florence-2-base analysis of the letter from President Roosevelt to Winston Churchill led to mixed results. This might have been caused by the difficulties encountered when transcribing an handwritten and not very easily readable letter. As we have seen through past examples, it is also interesting to study how the model adapts when faced with a lack of context — leading it to base its interpretations on its own knowledge (often bringing confusion and inaccuracies).

The first mistake was found in the first attempt at describing the text. The "CAPTION" prompt led to the following description: "A handwritten letter from the president of the united states, john f Kennedy". If the context of an handwritten letter is correct, as well as recognizing its sender as the President of the United States, the model mistakenly identified him as “john f Kennedy” (with a mix of uppercase and lowercase letters and punctuation issues).

The results with the "DETAILED_CAPTION" command were partly more accurate and partly more confused: "The image shows a handwritten letter from President Franklin D. Roosevelt to his wife, Mrs. Roosevelt. The letter is written on a piece of paper with a black ink pen, and the text is written in a neat cursive script. The paper is slightly yellowed with age, and there is a signature at the bottom of the page." The model accurately identified President Roosevelt as the sender and made correct observations concerning the description of the image and the layout of the letter. However, it made the mistake to identify the recipient as "Mrs. Roosevelt".

The results with the "MORE_DETAILED_CAPTION" seem to be based on Florence-2-base false assumptions, leading to a large description but without making much sense: "The image is a photograph of a handwritten letter from the President of the United States, John F. Kennedy. The letter is addressed to the White House and is dated June 20, 1864. The paper is yellowed with age and the handwriting is neat and legible.\n\nThe letter is written in black ink on a white background and is signed by Kennedy at the bottom right corner. The text reads, "Dear Kennedy, I will give you your support. He is truly helping us to support our efforts and we will never forget. I thank you for your support and support. I am so grateful for the support we have for you. I will always be there for you."\n\nAt the bottom of the letter, there is a signature from Kennedy, indicating that it is an autograph letter". There is a such high number of errors made by the model when processing and analyzing this text, it is difficulty to know where to start. First, Florence-2-base made the mistake to identify the sender as President John F. Kennedy again. But this time, it also identified the recipient of the letter as Kennedy himself: "Dear Kennedy, I will give you your support... I will always be there for you". The model indicates the letter was sent to the White House (even if it was in fact sent from there). It also incorrectly dates the text as from "June 20, 1864".

The results to the OCR prompt instruction were the following: 'THE WHITE HOUSETHE WHITE HOUSEDen Chnchill.11Middll Wllkis will gour youThi - H. is Truly helping Storpholitis and wer here.I knad thie vere ypplis is to gompofth as it dess B is.Sail on, oll sht of State.Sait an oh Womom wolong and post.Homenuty with all it is fear.I will all the hope of futine yearsIs hanging brattless on thy faltas zer youm". Even if the text is in English, the model had difficulties to decipher, understand and transcribe the handwritten letter. Obviously, this might have been caused by the fact this letter is not only handwritten but fairly difficult to read (objectively more than in the previous examples of this study). The transcription obtained with the OCR prompt makes no sense.

The analysis of the letter from Franklin D. Roosevelt to Albert Einstein by Qwen2-VL-2B

The second letter by President Franklin Roosevelt analyzed by Qwen2-VL-2B for this article is a different kind of letter: instead of a handwritten text, this is a typed document and the recipient in this case is Dr. Albert Einstein. Obviously, the results were much more accurate with a typed letter, which is always much easier to process for any model (whether Florence-2-base or Qwen2-VL-2B as we have seen in previous examples). For this letter, Qwen2-VL-2B still managed to make some mistakes, as we are going to see.

The first instruction given to Qwen2-VL-2B was to “Explain the text in the image”. Here is the description delivered: "The image is a scanned letter from Dr. Albert Einstein to Professor, dated October 19, 1939. The letter is addressed to the recipient, who is not specified, and contains a formal tone. The letter begins with a greeting and a request for a thank you letter from the recipient. The letter mentions the importance of the recipient's recent letter and the importance of the enclosed data. The letter also mentions the Bureau of Standards and a chosen representative of the Army and Navy to thoroughly investigate the possibilities of uranium. The recipient expresses gratitude for the letter and expresses the importance of the subject matter. The letter ends with a request for the recipient to accept the letter and express their sincere thanks".

For this first analysis, Qwen2-VL-2B managed to indicate the accurate date. The description of the letter shows it did understand the meaning of the text. Still, there is one big mistake made by identifying the sender as Albert Einstein himself, instead of the recipient. This led to the confusion as to whom is the recipient “who is not specified”.

The transcription of the letter by Qwen2-VL-2B was almost entirely accurate, showing the benefits of OCR processing with a typed document: "THE WHITE HOUSE\nWASHINGTON\n\nOctober 19, 1939\n\nMy dear Professor:\n\nI want to thank you for your recent letter and the most interesting and important enclosure.\n\nI found this data of such import that I have convened a Board consisting of the head of the Bureau of Standards and a chosen representative of the Army and Navy to thoroughly investigate the possibilities of your suggestion regarding the element of uranium.\n\nI am glad to say that Dr. Sachs will cooperate and work with this Committee and I feel this is the most practical and effective method of dealing with the subject.\n\nPlease accept my sincere thanks.\n\nVery sincerely yours,\n\nDr. Albert Einstein,\n\nOld Grove Road,\n\nNassau Point,\n\nPeconic, Long Island,\n\nNew York."

The mistake made in this transcription is once again the misinterpretation that Albert Einstein is the sender. Indeed, the model even indicated his name as the signature. However, the rest of the content in the letter is accurate.

The next instruction given to Qwen2-VL-2B was to "Explain this letter from the President of the United States Roosevelt to Professor Albert Einstein". For this particular case, the additional indications seemed to have helped the model to contextualize the document. It successfully identified the sender and recipient and even gave an accurate description of its content:

"This letter is a formal request from President Roosevelt to Albert Einstein, the renowned physicist and Nobel laureate, for his expertise and expertise in the field of nuclear physics. The letter is dated October 19, 1939, and it is addressed to Professor Einstein at his home in New York. The President expresses his gratitude for his recent letter and expresses his desire to have Professor Einstein participate in a committee to investigate the possibilities of using uranium for nuclear energy. The President also requests Professor Einstein's cooperation in the work of this committee. The letter ends with the President's signature and the words "Very sincerely yours"".

This analysis of two letters by President Roosevelt, both in English, one handwritten and analyzed by Florence-2-base and one typed and analyzed by Qwen2-VL-2B, lead to the following conclusions: the results of the analysis of the typed document were much more accurate, concerning the content of the text. Yet, the lack of clarity or differences in the layout can trigger mistakes to identify key elements in the letter (recipient, sender, location, date etc.). Also, the models’ previous knowledge can bring to misinterpretation when analyzing the document.

The analysis of text in visual arts: the study of NOTARY by Jean-Michel Basquiat

After analyzing and processing images containing a majority of textual elements (such as letters, newspapers, books etc.), we are going to study an image (in this case a painting) containing only few elements of text. In this case, the artwork is a canvas by Jean-Michel Basquiat, titled NOTARY, dated from 1983 and exhibited in the Princeton University Art Museum.

This painting is interesting to process with models such as Florence-2-base and Qwen2-VL-2B since the textual elements are disseminated throughout the artwork.

The analysis of the text in Jean-Michel Basquiat's NOTARY by Florence-2-base

With Florence-2-base, the result to analyze the Basquiat canvas with the “CAPTION” command was the following: “jean-michel basquiat, untitled, 1982”. The model successfully identified by itself the name of the artist, even if it did not find the name of the artwork (but many paintings by Basquiat being titled “untitled”, this is not a big mistake). The date is almost correct — to a year.

The “DETAILED_CAPTION” led to the following result: "The image shows Jean-Michel Basquiat's 1982 painting, 'Untitled', which is composed of a vibrant mix of colors and text on a black background. The painting is composed composed of abstract shapes and lines, creating a unique and captivating composition". As we can see, the model still base its answer on its own previous knowledge of the artist and on the same assumptions (concerning the title of the painting between others) than in the previous caption. The aesthetical description itself is brief yet rather accurate.

Interestingly enough, the “MORE_DETAILED_CAPTION” delivered an even longer description of the painting, while leaving out the name of the artist, of the artwork of the date: "The image is a collage of various artworks and text. The background is black, and the main focus is a large, abstract painting with a variety of colors and shapes. The painting is composed of different elements such as lines, shapes, and lines that create a chaotic and chaotic composition.\n\nIn the center of the painting, there is a figure with a large head and a long neck, wearing a red and black striped shirt. The figure appears to be standing in front of a large window with a view of a city skyline. On the left side of the image, there are several smaller figures, including a man with a skull and a woman with a red hat, both of whom are looking up at the sky. The figures are surrounded by various symbols and text that appear to be related to the city, such as a red circle, a black circle, and a white circle with the word "PLUTO" written on it. \n\nOn the right side of this image, the figure is standing with its arms crossed and its head tilted slightly to the side, as if it is looking off into the distance. The text on the painting reads "This is not for all debts, public debt, and public debt."\n\nThere are also several smaller illustrations and text scattered throughout the painting in different colors and sizes, including red, yellow, green, blue, and black. The overall effect is one of chaos and destruction, with a sense of unease and unease".

If some of the elements described here are not actually present (for example, the “woman with a red hat”), some other are fully accurate, such as the "figure with a large head and a long neck", the circles or the mentions of the word "PLUTO" and of the phrase "This not all debts". This shows the model managed to partly understand the artwork.

With the OCR task prompt to transcribe all written elements present in the painting, Florence-2-base delivered the following result: “NOTARYCASCODIMARITPLUTOLOFLEASDVMARISSTUDY OF THE150.MALE TORSOFLUTOLUTO.150DEHYDRATHSICKLESDEHYORAMATTOCKS0FEDTHIS NITEFOR ALL DEBTS4LEECHESSALTPUBLIC+PRIVATEG48LEECHESBUCKLE,ROMANITES47.LEECCHES.P504 BRONZER”.

While the words do not make sense put together without spacing between them in the same sentence, the model did successfully manage to find, decipher and transcribe all the written words and letters disseminated in the artwork. This suggests the model might not have understood each word individually and transcribed the whole series of letters together.

The analysis of the text in Jean-Michel Basquiat's NOTARY by Qwen2-VL-2B

Basquiat's NOTARY was then processed by Qwen2-VL-2B as well.

In a first attempt to process the text in this image, we asked the model to "Explain the image and the text in the image” which led to this result: "T"he image is a large, abstract painting featuring a variety of colorful and chaotic elements. The painting is divided into four sections, each containing different types of text and symbols. Here is a detailed description of the elements present in the image:\n\n1. **Top Left Section:**\n - Text: "NOTARY CASCO"\n - Symbols: A red and white circle with a black outline, resembling a notary seal.\n - Background: A light blue or greenish background with some white and black lines.\n\n2. **Top Right Section:**\n - Text: "PLUTO"\n - Symbols: A black and white image of a planet with a red and white circle around it.\n - Background: A dark blue or black background with some white and black lines.\n\n3. **Middle Left Section:**\n - Text: "DV MARIS"\n - Symbols: A black and white image of a skull with a red and white mouth.\n - Background: A light blue or greenish background with some white and black lines.\n\n4. **Middle Right Section:**\n - Text: "FLEAS"\n - Symbols: A black and white image of a flea.\n - Background: A dark blue or black background with some white and black lines.\n\n5. **Bottom Left Section:**\n - Text: "SALT"\n - Symbols: A red and white circle with a black outline.\n - Background: A light blue or greenish background with some white and black lines.\n\n6. **Bottom Right Section:**\n - Text: "DEHYDRATED"\n - Symbols: A black and white image of a person with a red and white mouth.\n - Background: A dark blue or black background with some white and black lines.\n\nThe painting appears to be a complex and abstract representation of various symbols and text, possibly alluding to themes of mythology, nature, and human anatomy. The use of different colors and symbols creates a sense of chaos and complexity, making it difficult to discern any clear message or narrative. The overall effect is a visually stimulating and thought-provoking piece of art".

As opposed to Florence-2-base, Qwen2-VL-2B did not manage to recognize the name of the artist or to try and identify the painting. But it did a good job to transcribe all the words present in the artwork while categorizing and organizing them, also situating them in the space of the painting while differentiating and describing the text, symbols etc… The transcription of the text in this artwork was very successful as the words were fully identified, understood and differentiated.

In a second attempt at transcription, the model was tasked to "Transcribe the text in the image": "The text in the image is:\n\n- Notary Casco\n- PLUTO\n- SALT\n- DEHYDRATED\n- SICKLES MATTOCKS\n- 46. LEECHES\n- 47. LEECHES\n- SITES\n- THIS NOTE FOR ALL DEBTS PUBLIC+PRIVATE""

As we can see, Qwen2-VL-2B delivered a very short yet accurate description of all the written words but without situating them this time in the image or describing other elements (symbols, shapes etc.).

In a third prompt, we asked Qwen2-VL-2B to "Analyze the meaning of the text in this painting by Jean-Michel Basquiat": "The painting by Jean-Michel Basquiat, titled "Study of the Male Torso," is a complex and abstract work that combines various elements of art and text. Here is a detailed analysis of the text in the painting:\n\n1. Notary Casco: This text appears to be a reference to a notary public or a notary's office. It could be a nod to the notary's role in verifying documents or the legal aspects of the painting.\n\n2. SICKLES MATTOCKS: This text is likely a reference to the artist's signature or a signature that appears in the painting. It could be a playful or artistic way of incorporating the artist's name into the artwork.\n\n3. DEHYDRATED: This text is likely a reference to the artist's signature or a signature that appears in the painting. It could be a playful or artistic way of incorporating the artist's name into the artwork.\n\n4. SALT: This text appears to be a reference to the artist's signature or a signature that appears in the painting. It could be a playful or artistic way of incorporating the artist's name into the artwork.\n\n5. PLUTO: This text appears to be a reference to the artist's signature or a signature that appears in the painting. It could be a playful or artistic way of incorporating the artist's name into the artwork.\n\n6. DEHYDRATED: This text appears to be a reference to the artist's signature or a signature that appears in the painting. It could be a playful or artistic way of incorporating the artist's name into the artwork.\n\n7. SICKLES MATTOCKS: This text appears to be a reference to the artist's signature or a signature that appears in the painting. It could be a playful or artistic way of incorporating the artist's name into the artwork.\n\n8. SALT: This text appears to be a reference to the artist's signature or a signature that appears in the painting. It could be a playful or artistic way of incorporating the artist's name into the artwork.\n\n9” (and so on)".

The answer delivered was very long, finally went in a loop at the end and did not manage to actually explain the meaning behind the text, only giving the same brief random explanation for every word. Interestingly enough, with the instruction indicating the painting is by Jean-Michel Basquiat, Qwen2-VL-2B tried to identify the name of the painting as well. In this case, it indicated it was "Study of the Male Torso" (which is the name of another painting by Basquiat). This shows once again the model abilities to find information in its own previous knowledge to interpret new content it does not have a lot of context on.

To push the model to try and analyze the meaning of the text and painting, we sent it the following questions and instructions: "What is the meaning of text in this image by Jean-Michel Basquiat? Is this in relation to its street art style? Explain".

The answer Qwen2-VL-2B gave was accurate on the surface but contained misinterpretations and was not very deep: "The text in the image by Jean-Michel Basquiat is a mix of various phrases and symbols, which may not have a direct meaning in the context of the artwork. However, it is clear that the text is meant to be read and interpreted by the viewer. The text appears to be a mix of words and symbols, which could be related to the themes and messages of the artwork. The street art style of Basquiat often incorporates elements of everyday language and street slang, which can be used to convey messages or express emotions".

The analysis of text in a Coca-Cola advertisement

The last image that will be analyzed in this AI OCR process study is a more recent: in this case, it is an ad produced by U.S. brand Coca-Cola, which dates from 2013. It is interesting to compare this image with our other examples as it is much simpler to understand with its very contrasting colors and short, large typed elements of text.

The analysis of the Coca-Cola ad by Florence-2-base

The first analysis of the ad was made with Florence-2-base.

The "CAPTION" command delivered the following result: "A group of people walking on a field with a coca cola sign". As it can be seen, the model processed and understood the text much more easily since it is much clearer to decipher.

The results to the "DETAILED_CAPTION" were the following: 'The image shows a group of people standing on a grassy field with trees in the background. On the left side of the image, there is text that reads "Coca-Cola - Help your favorite park win a $100,000 recreation grant”. The transcription is fully successful and complete.

With the 'MORE_DETAILED_CAPTION', we obtained the following description: 'The image shows a group of four people, two women and two children, walking on a grassy field. The woman on the left is wearing a white dress with a floral pattern and is holding the child in her arms. The child on the right is wearing blue shorts and a white t-shirt with the number 1 on it. They are all smiling and appear to be happy. The background shows trees and a blue sky. On the top left corner of the image, there is a red banner with the Coca-Cola logo and text that reads "Help your favorite park win a $100,000 recreation grant. Vote now".

Florence-2-base not only processed and understood the textual and visual content correctly, it also described the impression the image wants to give ("They are all smiling and appear to be happy").

The OCR task prompt led to a successful result, if not with the correct spacing between the others: “CocaColaHelp yourfavorite parkwin a $100,000recreation grant.VOTE NOW”.

The analysis of the Coca-Cola ad by Qwen2-VL-2B

The analysis of the Coca Cola ad with Qwen2-VL-2B was highly successful as well, showing the ease for the model to process a simpler content with large, easily readable letters.

The first instruction given to the model was to "Transcribe the text in the image" which led to this result: "Coca-Cola Help your favorite park win a $100,000 recreation grant. Vote now". The second instruction was to "Explain the text in the image”, to which the model gave its understanding of the ad: “The image is promoting a Coca-Cola campaign to help a favorite park win a $100,000 recreation grant. The text encourages viewers to vote for their favorite park”. The third instruction to “Explain the image” led to an interesting and accurate analysis: "The image is an advertisement for Coca-Cola, promoting a contest to help a favorite park win a $100,000 recreation grant. The text in the image reads, "Help your favorite park win a $100,000 recreation grant." Below this, there is a call to action that says, "VOTE NOW."" The background of the image features a red and green color scheme, with a woman and two children walking on a grassy field. The Coca-Cola logo is prominently displayed in the top left corner of the image".

Qwen2-VL-2B not only successfully transcribed and processed the ad, it even acknowledged the prominent display of the Coca-Cola logo in the image, the call to action and clearly stated it was an avertisement and what its goal was. This not only shows Qwen2-VL-2B great understanding for simpler images and textual content but it also highlights its abilities to understand advertisement and marketing strategies.

Conclusion

The goal of this study was to analyze how VLM models such as Florence-2-base and Qwen2-VL-2B managed to OCR process images containing text. For this purpose, we have processed diverse types of documents such as newspaper articles, letters (both handwritten and typed), artworks and advertisements.

We can draw the following conclusion from this study. First and foremost, it is clear that models such as Florence-2-base and Qwen2-VL-2B both have highly qualified abilities to process and analyze written text. This can be clear in the examples we have seen, especially concerning typed documents with a simple layout (and preferably in English). That being said, both models endured difficulties to process text in images when it was either harder to read, for example handwritten, less contextualized or in an another language than English. Two examples in this study were in Latin and in French. In both cases, the models (especially Florence-2-based) had troubles to process and analyze the text. When it comes to contextualization, even when given in instructions, both models seemed to base their analysis on their own "personal" knowledge, which led to misinterpretations with the original text in the image. For example, during the study of future Queen Elizabeth II letter, it was interesting to notice that while processing the text, the model described it as a letter to her husband, the Prince Philip (even if it was not the case, and they were not even married at the time). Also, when processing the letter from President Franklin Roosevelt, Florence-2-base made the same mistake twice of interpreting the letter as being written by President John F. Kennedy — probably making assumptions from the White House heading at the top and the style/period of the letter. Another issue encountered in this study was the difficult for both models to comprehend the layout in diverse types of letters. Both models made mistakes to identify the sender, mistaken at times with the recipient and vice-versa. Both Florence-2-base and Qwen2-VL-2B got confused by the position of the heading, the lack of signature or the variations in layout from one country to another and in different historical periods. Another thing that attracted our attention in the results is that when giving a lot of instructions, elements of context and informations about the text and the image in general the results were not more accurate but actually quite the contrary: the models seem to get overwhelmed with too much information and make false assumptions on the text based on what had been said. For instance, when analyzing an handwritten letter without indications about the sender, Qwen2-VL-2B modified the content of the original text after receiving the information the letter was written by Jane Austen. It then added several references to a “Mr Bingley”, not even mentioned in this particular letter, but making a reference to a character from Jane Austen’s Pride and Prejudice.

Even with these flaws in the results, it can be said without a doubt that OCR recognition and process by models such as Florence-2-base and Qwen2-VL-2B is efficient, even if the results highly depend on the image and on the context given. With the example of the U.S. Constitution, it was impressive how Qwen2-VL-2B managed to deliver an exact transcription of the text based on an extract and to contextualize it perfectly.

More fine-tuning and training to adapt the models to diverse layouts might help to obtain even better results. Also to increase the models abilities to process text in a larger variety of languages. But as it is today, both models successfully managed to deliver transcriptions of a large variety of written documents.

That being said, it would be interesting to compare the results obtained with Florence-2-base and Qwen2-VL-2B alongside newer, highly effective VLM models, such as Rhymes AI Aria and Mistral Pixtral-12B. The current method to use the models in this study, with the free GPU included on Google Colab, has made it difficult to lead the same textual and visual analysis with Aria and Pixtral. However, it might be interesting to lead a similar art and text analysis with the newly released Multimodal model Deepseek Janus-1.3B - which will be the subject for my next article.

Ask

How VLM models such as Florence-2-base and Qwen2-VL-2B manage to process and understand text in image?

The models we used for this study managed to process and understand text in images with different prompts and task instructions. In the case of Florence-2-base, the textual analysis was done with the “CAPTION”, “DETAILED_CAPTION”, “MORE_DETAILED_CAPTION” and “OCR” commands. With Qwen2-VL-2B, the transcription process was done with instructions such as to “Transcribe the text in the image” and “Explain the text in the image”. In some cases, specific informations were added to give more context to the models.

Which transcriptions results were better? For which reasons?

The best transcriptions made, either with Florence-2-base or with Qwen2-VL-2B, were the ones based on images that were easier to read, understand and process. For example, the Coca-Cola ad obtained the best results of the whole study, being a very clear and striking image with large typed contrasted letters. Also, the typed documents such as the New York Times front page article and the typed letter from President Roosevelt to Dr. Albert Einstein obtained the most accurate transcriptions. Highly well-known images, such as the one containing the text of the U.S. Constitution, were also better transcribed by both models.

What are the main difficulties encountered in this study?

The main difficulties encountered in this study were the following: the language used gave highly diverse results from one text to another, as both models struggled to transcribed texts in French and Latin but fared very well with texts in English. Another issue was the type of text in the image: in general, handwritten text was much more difficult to process by the models than when it was typed. Also, there were issues with the differences in layout from one image to another, leading to misinterpretations and confusion in the results. Also, the “personal” own knowledge of each model biased the results of some of the transcription and led to comprehension mistakes. Finally, adding too much context in the task instructions proved to cause inaccuracies in the results.

Which types of text were the easiest to process? Which were the most difficult?

The easiest types of text were the ones that were clearer for the model to OCR process and analyze. Typed text with a clear and simple layout were transcribed much more easily. On the opposite, handwritten, especially cursive, text were more difficult to understand, either with Florence-2-base or with Qwen2-VL-2B. The lack of a standard layout and of information giving context within the text also led to several difficulties.

What could be done to improve the OCR transcription by AI?

In order to improve the OCR process and transcription by AI, it should be useful to fine-tune and train the models even more, in order to get them used to diverse layouts, kinds of text both printed and handwritten and familiarize them to a larger range of languages.

Bibliography.

Book of hours. In latin and French, illuminated parchment manuscript. West of France, Brittany (Nantes?), circa 1450.
The Constitution of the United States, 1787
Jane Austen, Letter to her niece Anna Lefroy Nov. 29 1814
Vincent Van Gogh, Letter to Theo Van Gogh, Arles, Nov. 21, 1888, Van Gogh Letters (https://www.vangoghletters.org/vg/letters/let722/letter.html#original)
“Titanic Sinks Four Hours After Hitting Iceberg”, from The New York Times, April 16, 1912, Wikimedia Commons, (https://upload.wikimedia.org/wikipedia/commons/0/04/Titanic-NYT.jpg)
Franklin D. Roosevelt, Letter to Winston Churchill, January 20, 1941, Library of Congress (https://www.loc.gov/exhibits/churchill/interactive/_html/wc0112.html)
Franklin D. Roosevelt, Letter to Dr. Albert Einstein, October 19, 1939, Wikimedia Commons (https://commons.wikimedia.org/wiki/File:Roosevelt-einstein-letter.png)
Queen Elizabeth II, Letter to Mary, April 24, 1945 (https://www.express.co.uk/news/royal/1668073/queen-handwritten-letter-friend-world-war-two-spt)
Jean-Michel Basquiat (1960-1988), NOTARY, 1983, Princeton University Art Museum
Coca-Cola ad, 2013, Wikimedia Commons (https://commons.wikimedia.org/wiki/File:Coca-Cola_advertisement_%288980379083%29.jpg)

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote