Realtime Text Recognition with VIM4 and IMX415 MIPI Camera

JietChoo · December 13, 2024, 3:01pm

How do I know which video index is my Khadas MIPI IMX ?

Or is there any way i can enable gstreamer in my opencv and try it out?

On the side note, when i run this, it’s working fine, my camera can display the video feed on a window

gst-launch-1.0 -v v4l2src device=/dev/media0 io-mode=mmap ! video/x-raw,format=NV12,width=3840,height=2160,framerate=30/1 ! fpsdisplaysink video-sink=waylandsink sync=false text-overlay=false

JietChoo · December 16, 2024, 4:17am

Hi Louis,

I’m able to get video feed from my camera already (device 63). The problem was i missed out typing

export QT_QPA_PLATFORM=xcb

And i should not remove

cap.set(3,1920)
cap.set(4,1080)

However, the detection is not working properly.

Based on this image, it’s not detecting any texts.

Based on this image, it detected UGREEN but the detection is flickering, and does not always able to detect the word. Apart from that, the other word PD Fast Charger is not detected

Louis-Cheng-Liu · December 16, 2024, 9:51am

Hello @JietChoo ,

This is a simple demo for display only. It may not have high precision in every scenarios. Suggest refer Paddle official codes to change the demo codes to get better performance in different scenarios.

JietChoo · December 16, 2024, 3:08pm

Hi Louis, alright. If i use other OCR Models, will it work? Let’s say i want to use EasyOCR

https://github.com/JaidedAI/EasyOCR

Will it work on the PPOCR KSNN?

Louis-Cheng-Liu · December 18, 2024, 2:22am

Hello @JietChoo ,

Sorry, the postprocessing of the EasyOCR model is different from PPOCR. If you want to use, you need to change the preprocess and postprocess.

You can try to quantify the ppocr model in int16 and decrease the box thresh to improve the performance.

The box thresh is in ppocr/postprocess.py line 6.

About limitations of this demo.

The demo has not the algorithm for handling sloped characters, so only horizontal fields can be detected.
Because of the ppocr_rec model limit, too long fields always recognize mistake. But you can change the model input shape to deal with it. The input wider width model can recognize longer fields.

JietChoo · December 18, 2024, 4:22am

Hi Louis, I found another issue,

Ignoring XDG_SESSION_TYPE=wayland on Gnome. Use QT_QPA_PLATFORM=wayland to run on Wayland anyway

And my frames colour are weird

Louis-Cheng-Liu · December 18, 2024, 6:49am

Hello @JietChoo ,

Have you modified something? And what is your operation?

JietChoo · December 19, 2024, 7:36am

I dont think i modified anything.

Today i ran the same command

python3 ppocr-cap.py --det_model ./models/VIM4/ppocr_det_int8.adla --det_library ./libs/libnn_ppocr_det.so --rec_model ./models/VIM4/ppocr_rec_int16.adla --rec_library ./libs/libnn_ppocr_rec.so --device 63

And got this error again. I dont understand why.

 |---+ KSNN Version: v1.4.1 +---| 
Start init neural network ...
adla usr space 1.4.0.2
adla usr space 1.4.0.2
Done.
[ WARN:0@10.218] global cap_v4l.cpp:1136 tryIoctl VIDEOIO(V4L2:/dev/video63): select() timeout.
Traceback (most recent call last):
  File "/home/khadas/ksnn-vim4/examples/ppocr/ppocr-cap.py", line 103, in <module>
    det_img = cv.resize(orig_img, (736, 736)).astype(np.float32)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
cv2.error: OpenCV(4.10.0) /io/opencv/modules/imgproc/src/resize.cpp:4152: error: (-215:Assertion failed) !ssize.empty() in function 'resize'

Our goal is to detect alphabets from small cards like this.

We have developed a web application that runs gamified multiple choice questions. The answers are taken from the results from Vim4 Text Recognition. So, the result from the OCR will be sent to our Server, and the Web Application (Client end) will then get these results from the Server and process it at the Client end, determine whether the user has shown the correct alphabet card or not. This whole process will be in a school classroom environment, camera is setup infront the classroom, monitor will be shown infront as well, displaying the gamified application to the students. The students will then raise these alphabet cards in the classrooms. In the real scenario, the cards will be larger.

Now, I’m unable to run the camera using the KSNN code. Another thing is, the detection of the alphabet card seems not working as expected (tried it yesterday). We have to get this solution up as soon as possible.

Our alternative is use back a PC that has NVIDIA GPU. Using OCR solutions that supports CUDA. We already has a working prototype for that, and the OCR can detect the alphabets no issue.

However, we wish to use Khadas Vim4 for this solution as we have used Khadas Edges and VIM4s in our other IoT Projects (We also provide IoT Solutions and Automations), we believe Khadas could help us in this solution as well. Another thing is Vim4 has NPU, and we believe it can have a better OCR running performance.

If the above issues are solved, we are more than happy to deploy more of this to our clients’ classrooms.

JietChoo · December 19, 2024, 7:58am

Now the frame is like this

JietChoo · December 19, 2024, 8:01am

I closed the session, rerun again, the frame now looks like this

I dont understand why it will keep changing, i did not change anything. I just close and run the command again few times, the frames keep changing differently

JietChoo · December 19, 2024, 8:03am

The alphabet is not detected

Louis-Cheng-Liu · December 19, 2024, 10:02am

Hello @JietChoo ,

Are the letters so small in a picture? I try to detect on PC by ppocr original model and it also can not detect. Official model may not consider detecting a single alphabet, so they do not add too much training picture like you provide.

For all trained available model, they usually consider common scenarios. So for special scenario, if it perform bad, he solution is to train the model for this special scenario.

Actually, if you only want to detect a single alphabet, OCR is not the only option. YOLO also can do it.

And you use usb camera or mipi camera?

foxsquirrel · December 19, 2024, 4:21pm

Myself, I would try a different camera before giving up on the vim.

We have a BRIO 510 usb and the images are very crisp, that is important. imx219 class cameras are not good enough for high detail work. We are running jetson orin nano for edge detection even with a usb camera its very good.

JietChoo · December 20, 2024, 4:06am

Hi @Louis-Cheng-Liu

The below is the Python script in my Windows PC. I’ve imported the Paddle OCR Package and run it

import cv2
import numpy as np
import color_detection
from paddleocr import PaddleOCR, draw_ocr
from PIL import Image

# Also switch the language by modifying the lang parameter
ocr = PaddleOCR(lang="en") # The model file will be downloaded automatically when executed for the first time

cap = cv2.VideoCapture(0)
# cap.set(cv2.CAP_PROP_FRAME_WIDTH,2880)
# cap.set(cv2.CAP_PROP_FRAME_HEIGHT,1800)

if not cap.isOpened():
    exit()

frame_counter = 0

while True:
    ret,frame = cap.read()
    w,h,c = frame.shape
    print(ret)
    if ret and w > 0 and h > 0:
        frame_counter += 1
        print(frame_counter)
        # if frame_counter % 100 == 0:
        #     updateLiveUpdateToDb(camera_id)
        # if frame_counter % 10 == 0:
        result = ocr.ocr(frame)
        # result = reader.readtext(frame)
        for line in result:
            print(line)
        print(result)
        # image = Image.open(frame).convert('RGB')
        # print('result')
        # print(result)
        # boxes = [line[0] for line in result]
        # txts = [line[1][0] for line in result]
        # scores = [line[1][1] for line in result]
        # im_show = draw_ocr(image, boxes, txts, scores)
        # Image.fromarray(im_show)
        for bboxtextprobtuplearray in result:
            if bboxtextprobtuplearray is not None:
                bbox = bboxtextprobtuplearray[0][0]
                text = bboxtextprobtuplearray[0][1][0]
                prob = bboxtextprobtuplearray[0][1][1]
                print(f"Text: {text}")
                print(f"Prob: {prob}")
                if prob > 0.7:
                    top_left = bbox[0]
                    top_right = bbox[1]
                    bottom_right = bbox[2]
                    bottom_left = bbox[3]
                    if len(top_left) == 2 and len(top_right) == 2 and len(bottom_right) == 2 and len(bottom_left) == 2:
                        alphabet_image = frame[int(top_left[1]):int(bottom_left[1]),int(top_left[0]):int(top_right[0])]
                        result,result_mask,largest_pixel_count = color_detection.detect(alphabet_image)
                        color = (255, 255, 255)
                        if not result_mask is None:
                            if result == "red":
                                color = (0,0,255)
                            elif result == "yellow":
                                color = (0,255,255)
                            elif result == "blue":
                                color = (255,0,0)
                            elif result == "green":
                                color = (0,255,0)
                            cv2.putText(frame,f"{text},{result}",(int(top_left[0]),int(top_left[1] - 10)),cv2.FONT_HERSHEY_SIMPLEX,1,color,2)
                        else:
                            color = (255, 255, 255)
                            cv2.putText(frame,f"{text},None",(int(top_left[0]),int(top_left[1] - 10)),cv2.FONT_HERSHEY_SIMPLEX,1,color,2)
                        cv2.rectangle(frame,tuple(map(int, top_left)), tuple(map(int, bottom_right)),color,2)
                        # cv2.imshow("a",alphabet_image)
                        # else:
                        #     img = np.zeros((100, 100, 3), dtype=np.uint8)
                        #     cv2.imshow("a",img)
                        # print(f"Bbox: {bbox}")
                        # print(f"Text: {text} Probability: {prob}")
                        # if not result is None:
                        #     writeResultToDb(camera_id,text,result)
                            # f = open(f"{result}.txt", "w")
                            # f.write(f"{text},{result}")
                            # f.close()
    else:
        frame = np.zeros((100, 100, 3), dtype=np.uint8)

    cv2.imshow("Cam",frame)

    if cv2.waitKey(1) == ord("q"):
        break

cv2.destroyAllWindows()

I’m able to detect alphabets properly.

For Windows, in using my webcam.

Another thing, I also tried the EasyOCR Model, running on my Windows PC with NVIDIA, also running smoothly and able to detect the alphabets.

Another reason that we wish to use VIM4 is also it is more portable in classroom environments

Do you have YOLO demos for detecting single alphabet?
I’m using MIPI IMX 415 for Khadas VIM4
https://www.khadas.com/product-page/imx415-camera
Another thing, the dataset in the KSNN convertion is in txt format. Does it mean i need to put the tesitng image file paths in txt file?
Does the format of the text file needs to be like this?

image1411×407 32.6 KB

Louis-Cheng-Liu · December 20, 2024, 10:15am

Hello @JietChoo ,

There is an exception for using MIPI in the python virtual environment. We make a deb package to install python libs which PPOCR need. I test MIPI camera can work.

deb download link.
https://dl.khadas.com/.test/ksnn-vim4_1.4.1_py3-noble_arm64.deb

Install

$ sudo apt update
$ sudo apt install python3-opencv
$ dpkg -i ksnn-vim4_1.4.1_py3-noble_arm64.deb

Modify ppocr-cap.py as follow

- cap = cv.VideoCapture(int(cap_num))
+ pipeline = "v4l2src device=/dev/media0 io-mode=dmabuf ! queue ! video/x-raw,format=YUY2,framerate=30/1 ! queue ! videoconvert ! appsink"
+ cap = cv.VideoCapture(pipeline, cv.CAP_GSTREAMER)

Then run.

$ export QT_QPA_PLATFORM=xcb
$ python3 ppocr-picture.py --det_model ./models/VIM4/ppocr_det_int8.adla --det_library ./libs/libnn_ppocr_det.so --rec_model ./models/VIM4/ppocr_rec_int16.adla --rec_library ./libs/libnn_ppocr_rec.so --picture ./data/test.png

You can try it whether it reports an error.

About the demo detect bad for single alphabet, i ask the engineers to try to improve it.

About YOLO, now we do not have. I mean you can try a YOLO model to detect single alphabet.

About dataset. Yes, you need to prepare testing images 200-500. The images are preferably the actual usage scene. Write path in dataset.txt like convert tool. Remember change iterations to the number of images.

JietChoo · December 24, 2024, 7:44am

Hi Louis,

I have downloaded the deb file and do the necessary installation and modification in the code. I am able to run the code without a virtual environment in Vim4.

However I still facing the below issue

Works properly with words (Not an issue)

capture_screenshot_24.12.20241920×1080 125 KB
Does not work when words are skewed. Is the package able to detect rotated or skewed words? In my case, i just need to detect skewed or rotated alphabets, but not word. I will need to modify the code to detect text color.

capture_screenshot_24.12.2024(1)1920×1080 124 KB
Unable to detect alphabet. We need to detect alphabet (straight, skewed, slightly rotated). There are some part which have been changed to red pixels as well.

capture_screenshot_24.12.2024(2)1920×1080 137 KB

Another thing, are you able to show me some sample testing images? Do i need to draw bounding boxes on the images? And also are you able to show me a sample of the dataset.txt file as well?

Louis-Cheng-Liu · December 24, 2024, 8:46am

Hello @JietChoo ,

The demo does not add algorithm for detecting skewed text. For detecting one alphabet, now we are impoving the demo.

For dataset.txt, the format like this.

The images do not need draw bounding boxes. Use your actual usage scenario pictures.

The det model quantified images.

The rec model quantified images.

JietChoo · December 24, 2024, 8:48am

Thank you for your reply. If that’s the case, how do i improve to detect skewed alphabets?

Questions:-

filename for det model quantified images is not important, i can put whatever i want right?
The image for det model quantified images can be any image? Does not necessary be an image that contains texts?
filename for rec model quantified images, i need to write the recognized text in the filename?

JietChoo · December 24, 2024, 9:28am

Our clients are waiting for our prototype. Their deadline for prototype is end of Dec2024/ beginning Jan 2025. If Khadas solution is not ready yet, we will provide them with the Windows NVIDIA prototype first. At the mean time, we will work on Khadas VIM4, and hopefully can get it done very very soon.

They want to deploy this to the market in Q1 next year. We hope to get Alphabet recognition (straight, rotated, skewed), with color recognition and detection distance of up to 200cm from the camera. In the end, if Khadas is not ready, maybe we will release with the Windows NVIDIA first.

Louis-Cheng-Liu · December 24, 2024, 9:45am

Hello @JietChoo ,

Filename for picture is not important. Whatever it is ok. Quantified images have better is actual usage scenario. Not necessary have text in each picture but have better without any text for all pictures.

About rotated and skewed, what angle do you want to detect? Could you provide some images that you want to detect?

This week will release a version for detecting one alphabet. You can check whether it is suitable for you.

And color recognition is necessary? Which color recognition you use?