Last Updated:

Investigation of the algorithm of programs

 

The study of the algorithm of programs written in high-level languages traditionally begins with the reconstruction of the key structures of the source language - functions, local and global variables, branches, loops, and so on. This makes the disassembled listing more visual and greatly simplifies its analysis.

 

Investigation of the algorithm of programs

 

Fifteen years ago, Chris Caspersky's epic work, The Fundamentals of Hacking, was the reference book of every aspiring computer security researcher. However, time passes, and the knowledge published by Chris loses relevance. We tried to update this voluminous work and transfer it from the days of Windows 2000 and Visual Studio 6.0 to the days of Windows 10 and Visual Studio 2017.

Modern disassemblers are quite intelligent and take on the lion's share of the recognition of key structures. In particular, IDA Pro successfully copes with the identification of standard library functions, local variables addressed through the register ESP, case-branches and other things. However, sometimes it is wrong, misleading the researcher, moreover, its high cost does not always justify the use. For example, students studying assembler (and the best means of studying assembler is disassembling other people's programs), it is hardly affordable.

Of course, the light on the IDA wedge did not converge, there are other disassemblers - say, the same DUMPBIN, which is part of the standard supply of the SDK. Why not take advantage of it at worst? Of course, if there is nothing better at hand, DUMPBIN will do, but in this case you will have to forget about the intellectuality of the disassembler and use only your head.

First of all, we will get acquainted with non-optimizing compilers - the analysis of their code is relatively simple and quite accessible for understanding even for beginners in programming. Then, having got used to the disassembler, let's move on to more complex things - optimizing compilers that generate very cunning, confusing and ornate code.

Put on your favorite music, choose your favorite drink and dive into the depths of disassembly listings.

Feature Identification

 

A function (also called a procedure or subroutine) is the basic structural unit of procedural and object-oriented languages, so disassembling code usually begins with identifying functions and identifying the arguments passed to them.

Strictly speaking, the term "function" is not present in all languages, but even where it is present, its definition varies from language to language. Without going into details, we will understand the function as a separate sequence of commands called from different parts of the program. A function can take one or more arguments, or it can take none; may or may not return the result of his work - this is no longer the point. A key property of a function is to return control to the place where it is called, and its characteristic feature is multiple calls from different parts of the program (although some functions are called from only one place).

How does the function know where to return control? Obviously, the calling code must first save the return address and, along with other arguments, pass it to the function being called. there are many ways to work around this problem: for example you can put an unconditional transition to the return address at the end of the function before you call the function you can store the return address in a special variable and then perform an indirect transition after the function is complete by using the variable as an instruction operand jump... Without dwelling on the discussion of the strengths and weaknesses of each method, we note that compilers in the vast majority of cases use special machine instructions. CALL and RET, respectively designed to call functions and return from them.

Instruction CALL throws the address of the following instruction to the top of the stack, and RET tightens and transfers control to it. The address to which the instruction points CALL, and is the start address of the function. And the instruction closes the function RET (but note: not everyone RET denotes the end of a function!).

Thus, the function can be recognized in two ways: by cross-references leading to the machine instruction. CALL, and according to its epilogue, ending with the instruction RET. Cross-references and the epilogue together allow you to determine the addresses of the beginning and end of the function. Looking ahead a bit, we note that at the beginning of many functions there is a characteristic sequence of commands called the prologue, which is also suitable for identifying functions. And now let's look at all these topics in more detail.

Call a function directly

Looking through the disassembled code, we find all the instructions CALL is the content of their operand and will be the desired address of the beginning of the function. The address of non-virtual functions called by name is calculated at the compilation stage, and the instruction operand CALL in such cases, it is an immediate value. Thanks to this, the address of the beginning of the function is revealed by simple parsing: we search for all substrings by contextual search CALL and memorize (record) direct operands.

Consider the following example (Listing1 in the materials for the article):

Compile in the usual way:

The compilation result in IDA Pro should look something like this:

.text:00401020 push
.text:00401021 mov ebp, esp
.text:00401023 push
.text:00401024 call sub_401000
.text:00401024 ; Here we caught the call instruction with an immediate operand,
.text:00401024 ; which is the address of the beginning of the function. More precisely, its displacement
.text:00401024 ; in the code segment (in this case, in the .text segment).
.text:00401024 ; Now we can go to the line .text:00401000 and by giving the functions
.text:00401024 ; proper name, replace the operand of the call instruction with the construct
.text:00401024 ; "call offset Name of my function".
.text:00401024 ;
.text:00401029 mov [ebp+var_4], 666h
.text:00401029 ; And here is our familiar number 0x666 assigned to the variable
.text:00401030 call sub_401000
.text:00401030 ; And here is another function call! Referring to the line .text:401000,
.text:00401030 ; we will see that this set of instructions is already defined as a function,
.text:00401030 ; and all that needs to be done is to replace call 401000 with
.text:00401030 ; "call offset Name of my function".
.text:00401030 ;
.text:00401035 xor eax, eax
.text:00401037 mov esp, ebp
.text:00401039 pop ebp
.text:0040103A retn
.text:0040103A ; Here we met a return instruction from a function, but not a fact,
.text:0040103A ; that this is really the end of the function, because a function can have
.text:0040103A ; and multiple exit points. However, look: next to ret is located
.text:0040103A ; start of the next function. Since functions cannot overlap,
.text:0040103A ; it turns out that this ret is the end of the function!
.text:0040103A sub_401020 endp
.text:0040103B sub_40103B proc near ; DATA XREF: .rdata:0040D11C?o
.text:0040103B push esi
.text:0040103C push 1
...

Judging by the addresses, "our function" in the listing is located above the main function:

.text:00401000 push
.text:00401000 ; This string is referenced by the operands of several call instructions.
.text:00401000 ; Therefore, this is the address of the start of "our function".
.text:00401001 mov ebp, esp ; <-
.text:00401003 push ecx ; <-
.text:00401004 mov eax, [ebp+var_4] ; <-
.text:00401007 add eax, 1 ; <- body of "our function"
.text:0040100A mov [ebp+var_4], eax ; <-
.text:0040100D mov esp, ebp ; <-
.text:0040100Fpop ebp; <-
.text:00401010 retn ; <-

As you can see, it's very simple.

Pointer-based function calls

 

However, the task becomes much more complicated if the programmer (or compiler) uses indirect calls to functions, passing their address in the register and dynamically calculating it (address, not register!) at the execution stage of the program. This is how, in particular, work with virtual functions is implemented, but in any case, the compiler must somehow save the address of the function in the code. So, it can be found and calculated! It is even easier to download the studied application to the debugger, install it on the "investigative" instruction CALL a breakpoint and, after waiting for the debugger to surface, see to which address it will transfer control. Consider the following example (Listing2):

int func(){
return 0;
}


int main(){
int(*a)();
a = func;
a();
}

The result of its compilation should generally look like this (main function):

.text:00401010 push
.text:00401011 mov ebp, esp
.text:00401013 push
.text:00401014 mov [ebp+var_4], offset sub_401000
.text:0040101B call [ebp+var_4]
.text:0040101B; Here is a CALL statement making an indirect function call
.text:0040101B; at the address contained in cell [ebp+var_4].
.text:0040101B; How can you find out what is in there? Let's raise our eyes to the top
.text:0040101B; and find: mov [ebp+var_4], offset sub_401000. Aha!
.text:0040101B; This means that control is transferred at offset sub_401000,
.text:0040101B; where is the address of the beginning of the function! Now there is only
.text:0040101B; give the function a meaningful name.
.text:0040101E xor eax, eax
.text:00401020 mov esp, ebp
.text:00401022 pop ebp
.text:00401023 retn

 

Calling a Pointer Function with a Complex Target Address Calculation

 

In some, rather few programs, there is also an indirect call to a function with a complex calculation of its address. Consider the following example (Listing3):

int func_1(){
return 0;
}


int func_2(){
return 0;
}


int func_3(){
return 0;
}


int main(){
intx;
int a[3] = {(int) func_1,(int) func_2, (int) func_3}; int(*f)();


for (x=0;x < 3;x++){
f = (int (*)()) a[x];
f();
}
}

The result of disassembling this code should generally look like this:

.text:00401030 push ebp
.text:00401031 mov ebp, esp
.text:00401033 sub esp, 18h
.text:00401036 mov eax, ___security_cookie
.text:0040103B xor eax, ebp
.text:0040103D mov [ebp+var_4], eax
.text:00401040 mov [ebp+var_10], offset sub_401000
.text:00401047 mov [ebp+var_C], offset sub_401010
.text:0040104E mov [ebp+var_8], offset sub_401020
.text:00401055 mov [ebp+var_14], 0
.text:0040105C jmp short loc_401067
.text:0040105E ; --------------------------------------
.text:0040105E
.text:0040105E loc_40105E: ; CODE XREF: sub_401030+4A?j
.text:0040105E mov eax, [ebp+var_14]
.text:00401061 add eax, 1
.text:00401064 mov [ebp+var_14], eax
.text:00401067
.text:00401067 loc_401067: ; CODE XREF: sub_401030+2C?j
.text:00401067 cmp [ebp+var_14], 3
.text:0040106B jge short loc_40107C
.text:0040106D mov ecx, [ebp+var_14]
.text:00401070 mov edx, [ebp+ecx*4+var_10]
.text:00401074 mov [ebp+var_18], edx
.text:00401077 call [ebp+var_18]
.text:0040107A jmp short loc_40105E
.text:0040107C ; --------------------------------------
.text:0040107C
.text:0040107C loc_40107C: ; CODE XREF: sub_401030+3B?j
.text:0040107C xor eax, eax
.text:0040107E mov ecx, [ebp+var_4]
.text:00401081 xor ecx, ebp
.text:00401083 call @__security_check_cookie@4 ; __security_check_cookie(x)
.text:00401088 mov esp, ebp
.text:0040108A pop ebp
.text:0040108B retn

In the line call [ebp+var_18] there is an indirect call to the function. And what do we have in [ebp+var_18]? Look up the line — in [ebp+var_18] we have a value edx. And what is equal to himself edx? Scroll another line up — edx equal to the contents of the cell [ebp+ecx*4+var_10]. Here's the thing! Not only do we need to find out the contents of this cell, but we also have to calculate its address!

What is ECX? Content [ebp+var_14]. Does it equal to what? "We'll find out now..." we mutter under our breath, scrolling the disassembler screen up. Yep, found: in the line 0x401064 content is loaded into it EAX! What a joy! And how long are we going to wander around the code like that?

Of course, it is possible, spending an indefinite amount of time, effort and an invigorating drink, to reconstruct the entire key algorithm as a whole (especially since we have almost come to the end of the analysis), but where is the guarantee that no mistakes will be made?

Much faster and more reliable to load the test program into the debugger, set the rattle on the line text:00401077 and, waiting for the debugger window to pop up, see what we have in the cell [ebp+var_18]. The debugger will pop up three times, and each time show a new address! Note that it is possible to determine this fact in the disassembler only after a complete reconstruction of the algorithm.

However, do not have any illusions about the power of the debugger. A program can call the same function a thousand times, and one thousand times it can call a completely different one. The debugger is powerless to determine this. After all, the call to such a function can occur at an unpredictable moment, for example, at a certain combination of time, data processed by the program and the current phase of the moon. Well, wouldn't we spend ages chasing the program under the debugger?

The disassembler is another matter. A complete reconstruction of the algorithm will allow you to unambiguously and reliably track all addresses of indirect calls. That's why the disassembler and debugger must jump in the same harness!

Finally, I suggest taking a look at this section of the disassembled listing:

let's use the ida tools and see what is loaded into the memory cells [ebp+... ]. And these are just the addresses of our three functions, sequentially placed by the compiler behind each other:

 

"Manual" call to a function by a JMP statement

The most severe case is represented by "manual" calls to the function by the command JMP with pre-sending to the stack of the return address. Call via JMP in general, it looks like this: PUSH ret_addrr/JMP func_addrWhere is ret_addrr and func_addr are the direct or indirect return and start addresses of the function, respectively. By the way, note that the teams PUSH and JMP do not always follow one another and are sometimes separated by other teams.

A reasonable question arises: what is so bad? CALL and why resort to JMP? The fact is that the function called by CALL, after the control is returned, the parent function always transfers control to the command following CALL. in some cases (for example when you are structurally handling exceptions) you may want to continue execution after you return from a function that is not the next time that you return from a function CALL a team, and a completely different branch of the program. Then you have to manually enter the required return address and call the child function through JMP.

Identifying such functions is very difficult - contextual search does not give anything, since commands JMP, used for local transitions, in the body of any program is very, very much - try to analyze them all! If you do not do this, two functions will fall out of sight at once - the function being called and the function to which control is transferred after returning. Unfortunately, there are no quick fixes to this problem, the only clue is causing JMP almost always goes beyond the boundaries of the function in the body of which it is located. You can determine the boundaries of the function by the epilogue.

Consider the following example (Listing4):

int funct(){
return 0;
}


int main(){
__asm{
LEA ESI, return_addr
PUSH-ESI
JMP function
return_addr:
}
}

The result of its compilation in the general case should look like this:

Look, it would seem, a trivial conditional transition, what is it about it? Oh no! It's not a simple transition, it's a disguised function call! Where does it come from? Let's move on to the offset sub_401000 and let's see:

Where do you think this retn takes control back? Naturally, at the address at the top of the stack. And what do we have on the stack? PUSH EBP from the line 401000, pushed back out by the instruction POP from the line 401005. Go back to the point of unconditional transition, and begin to slowly scroll the disassembler screen up, tracking all accesses to the stack. Yeah, got a bird! Instruction PUSH ESI from the line 40101A throws the contents of the register to the top of the stack ESI, and he himself, in turn, takes the "chest" value "on the chest" loc_401020 is the address of the beginning of the function called by the command JMP (or rather, not an address, but a bias, but this is not fundamentally important):

 

Automatic feature identification via IDA Pro

 

Disassembler IDA Pro is able to analyze instruction operands CALLthat allows it to automatically split the program into functions. Moreover, the IDA quite successfully copes with most indirect challenges. Meanwhile, modern versions of the disassembler for a time or two cope with complex and "manual" calls to functions by the command JMP.

 

IDA successfully recognized the "manual" call to the function

Prologue

Most non-optimizing compilers put the following code, called a prologue, at the beginning of a function:

In general terms, the purpose of the prologue is as follows: if the register EBP is used to address local variables (as is often the case), then before it is used, it must be stored in the stack (otherwise the called function will "tear off the roof" of the parent), then in EBP copies the current register value of the stack vertex pointer ( ESP) - the so-called opening of the stack frame occurs, and the value ESP is reduced by the size of the memory space allocated for local variables.

Sequence PUSH EBP/MOV EBP,ESP/SUB ESP,xx can serve as a good signature for finding all the functions in the file under investigation, including those that are not directly referenced. This technique, in particular, is used in its work by IDA Pro, but the optimizing compilers are able to address local variables through the register ESP and use EBP like any other general purpose register. The optimized function prologue consists of only one command SUB ESP, xxx - The sequence is too short to be used as a function signature, alas. A more detailed account of the epilogues of functions awaits us ahead.

Epilogue

 

At the end of its life, the function closes the stack frame by moving the stack top pointer "down" and restores it to its previous value EBP (unless the optimizing compiler addressed local variables through ESPusing EBP as a regular general-purpose register). The epilogue of the function may look twofold: or ESP is incremented by the desired value by the command ADD, or the value is copied to it EBPthat points to the bottom of the stack frame.

The generalized epilogue code of the function looks like this. Epilogue 1:

Epilogue 2:

Important to note: between teams POP EBP/ADD ESP, xxx and MOV ESP,EBP/POP EBP there may be other commands— they don't have to follow close together. Therefore, to search for epilogues, contextual search is not suitable - you need to use a mask search.

If a function is written with PASCAL in mind, it has to clear the stack of arguments on its own. In the vast majority of cases, this is done by an instruction RET nWhere is n is the number of bytes to be removed from the stack after check-in. functions that comply with the C-convention provide stack cleanup to the calling code and always end with the command RET. The Windows API functions are a combination of the C and PASCAL conventions—arguments are written to the stack from right to left, but the function itself clears the stack.

 

Thus RET may be a sufficient indication of the epilogue of function, but not every epilogue is the end. If a function has multiple operators in its body return (as is often the case), the compiler generally generates its own epilogue for each of them. It is necessary to pay attention to whether there is a new prologue behind the end of the epilogue or whether the code of the old function continues. Also, we must not forget that compilers usually (but not always!) do not put code in an executable file that never gets control. In other words, the function will have only one epilogue, and everything after the first return will be discarded as unnecessary.

 

Meanwhile, do not rush ahead of the locomotive. Let's compile the following example (Listing5) with the default parameters:

 

int func(int a){
return a++;
a=1/a;
return a;
}


int main(){
func(1);
}

The compiled result will look like this (only the function code is shown func):

.text:00401000 push ebp
.text:00401001 mov ebp, esp
.text:00401003 push ecx
.text:00401004 mov eax, [ebp+arg_0]
.text:00401007 mov [ebp+var_4], eax
.text:0040100A mov ecx, [ebp+arg_0]
.text:0040100D add ecx, 1 ; We perform addition
.text:00401010 mov [ebp+arg_0], ecx
.text:00401013 mov eax, [ebp+var_4]
.text:00401016 jmp short loc_401027 ; Making an unconditional jump
.text:00401016; to the function epilogue
.text:00401018 ; --------------------------------------
.text:00401018 mov eax, 1
.text:0040101D cdq
.text:0040101E idiv [ebp+arg_0] ; Unit division code per parameter
.text:00401021 mov [ebp+arg_0], eax ; remained
.text:00401024 mov eax, [ebp+arg_0] ; The compiler did not consider it necessary to remove it
.text:00401027
.text:00401027 loc_401027: ; CODE XREF: sub_401000+16?j
.text:00401027 mov esp, ebp ; There is only one epilogue
.text:00401029 pop ebp;
.text:0040102A retn

Now let's see what code the compiler generates when an unplanned exit from a function occurs when some condition (Listing6) is triggered:

int func(int a){
if (a != 0)
return a++;
return 1/a;
}


int main(){
func(1);
}

Compilation result (only func):

.text:00401000 push
.text:00401001 mov ebp, esp
.text:00401003 push
.text:00401004 cmp [ebp+arg_0], 0 ; Compare function argument to zero
.text:00401008 jz short loc_40101E ; If they are equal, go to the label and
.text:00401008 ; execute the division command
.text:0040100A mov eax, [ebp+arg_0] ; If
.text:0040100D mov [ebp+var_4], eax ; not equal,
.text:00401010 mov ecx, [ebp+arg_0] ; then we execute
.text:00401013 add ecx, 1 ; increment
.text:00401016 mov [ebp+arg_0], ecx
.text:00401019 mov eax, [ebp+var_4]
.text:0040101C jmp short loc_401027
.text:0040101E ; --------------------------------------
.text:0040101E
.text:0040101E loc_40101E: ; CODE XREF: sub_401000+8?j
.text:0040101E mov eax, 1
.text:00401023 cdq
.text:00401024 idiv [ebp+arg_0] ; Divide 1 by argument
.text:00401027
.text:00401027 loc_401027: ; CODE XREF: sub_401000+1C?j
.text:00401027 mov esp, ebp ; <-- This is clearly an epilogue
.text:00401029 pop ebp ; <--
.text:0040102A retn ; <--

As in the previous case, the compiler created only one epilogue. Note: at the beginning of the function in the line 00401004 the argument is compared to zero, if the condition is satisfied, the label is switched to loc_40101Ewhere division is performed, followed immediately by an epilogue. If the same condition in the string 00401004 is not performed, addition is performed and there is an unconditional jump to the epilogue.

Special Notice

 

Starting with the 80286 processor, two instructions appeared in the instruction set ENTER and LEAVEdesigned specifically to open and close a stack frame. However, they are almost never used by modern compilers. Why?

The reason is that ENTER and LEAVE very slow, much slower PUSH EBP/MOV EBP,ESP/SUB ESB, xxx and MOV ESP,EBP/POP EBP. So, on the good old Pentium ENTER is executed in ten bars, and the given sequence of commands is executed in seven. Similarly LEAVE requires five cycles, although the same operation can be performed in two (and even faster if split MOV ESP,EBP/POP EBP some team).

 

Therefore, a modern researcher will never encounter a single ENTER, neither with LEAVE. Although it will not be superfluous to remember their purpose (it is unlikely that you will suddenly have to disassemble ancient programs or programs written in assembler - it is no secret that many writers in assembler know very poorly the intricacies of the processor and their "manual optimization" is noticeably inferior to the compiler in performance).

"Naked" functions

 

the microsoft visual c++ compiler supports a non-standard qualifier naked, allowing programmers to create functions without a prologue and epilogue. The compiler doesn't even put functions at the end RET, and this has to be done "manually", resorting to assembly insertion __asm{ret} (usage return does not lead to the desired result).

Actually, support for naked functions was conceived exclusively for writing drivers in pure C (with a small admixture of assembler inclusions), but it found unexpected recognition among the developers of protective mechanisms. Indeed, it is nice to be able to "manually" create functions without worrying that they will be "disfigured" by the compiler in an unpredictable way.

For us, codec diggers, to a first approximation, this means that the program may encounter one or more functions that contain neither a prologue nor an epilogue. So what's wrong with that? Optimizing compilers also throw out the prologue, and leave only one from the epilogue RET, but functions are simply identified by the instruction that calls them CALL.

Identification of inline functions

 

The most effective way to get rid of the overhead of calling functions is not to call them. In fact, why not embed the function code directly into the calling function itself? Of course, this will significantly increase the size (and the more noticeable the larger the function is called), but it will significantly increase the speed of program execution (and the more significantly the more often the deployed function is called).

What's wrong with deploying functions for program research? First of all, it increases the size of the mother function and makes its code less visual – instead of CALL\TEST EAX,EAX\JZ xxx with a conspicuous conditional transition, we see a bunch of non-resembling instructions, the logic of which has yet to be understood.

Built-in functions have neither their own prologue nor epilogue, their code and local variables (if any) are completely implanted in the calling function, the result of compilation looks exactly as if there was no call to the function. The only clue is that embedding a function inevitably leads to duplication of its code in all places of the call, and this, although difficult, can be detected. It is difficult because the embedded function, becoming part of the calling function, is optimized through in the context of the latter, which leads to significant code variations.

Consider the following example to see how the compiler optimizes an embedded function (Listing7):

#include <stdio.h>

inline int max(int a, int b){
if(a > b)
return a;
return b;
}

int main(int argc, char **argv){
printf("%x\n",max(0x666,0x777));
printf("%x\n",max(0x666,argc));
printf("%x\n",max(0x666,argc));
return 0;
}

The result of compiling this code will be as follows (main function):

.text:00401000 push
.text:00401001 mov ebp, esp
.text:00401003 push 777h; Preparing two
.text:00401008 push 666h; function argument
.text:0040100D call sub_401070; Calling the comparison function max
.text:00401012 add esp, 8
.text:00401015 push eax
.text:00401016 push offset unk_412160 ; Add parameter %x
.text:0040101B call sub_4010D0; call printf
.text:00401020 add esp, 8
.text:00401023 mov eax, [ebp+arg_0] ; We take the argument argc - the number of arguments
.text:00401026 push eax
.text:00401027 push 666h; We push the constant
.text:0040102C call sub_401070; Calling the max function
.text:00401031 add esp, 8
.text:00401034 push eax
.text:00401035 push offset unk_412164 ; Add parameter '%x'
.text:0040103A call sub_4010D0; call printf
.text:0040103F add esp, 8
.text:00401042 mov ecx, [ebp+arg_0] ; <--
.text:00401045 push ecx ; <--
.text:00401046 push 666h; <--
.text:0040104B call sub_401070; <-- Similar
.text:00401050 add esp, 8; <-- sequence
.text:00401053 push eax ; <-- actions
.text:00401054 push offset unk_412168 ; <--
.text:00401059 call sub_4010D0; <--
.text:0040105E add esp, 8
.text:00401061 xor eax, eax
.text:00401063 pop ebp
.text:00401064 retn

"So-and-so," we whisper to ourselves. And what did he compile here? I introduced the built-in function in the form of a regular one! Here's the thing! The compiler scored on our desire to make the function embeddable (after all, we wrote the modifier inline). The situation is not even corrected by the use of compiler parameters: /Od or /Oi. The first is used to disable optimization, the second is used to create embedded functions. At this rate, the compiler will soon generate code that pleases the developer's own preferences, rather than the programmer using it! You can see the rest in the comments to the disassembled listing.

Comparison function max in disassembled form it will look like this:

.text:00401070 push
.text:00401071 mov ebp, esp
.text:00401073 mov eax, [ebp+arg_0]
.text:00401076 cmp eax, [ebp+arg_4] ; Compare and depending
.text:00401079 jle short loc_401080 ; from the result - go
.text:0040107B mov eax, [ebp+arg_0]
.text:0040107E jmp short loc_401083 ; Unconditional transition to the epilogue
.text:00401080 loc_401080: ; CODE XREF: sub_401070+9?j
.text:00401080 mov eax, [ebp+arg_4]
.text:00401083 loc_401083: ; CODE XREF: sub_401070+E?j
.text:00401083 pop ebp
.text:00401084 retn

Here, too, all the important fragments are commented out.

Finally, I propose to compile and consider the following example (Listing8). It is a bit more complicated than the previous one, it uses a command-line argument as one of the values for comparison, which is converted from a string to a number and when outputd back.

#include <iostream>

#include <string>

#include <string>

using namespace std;




inline string max(int a, int b){ // Inline function for finding the maximum

int val = (a > b) ? a : b;

string stream;

stream << "0x" << hex << val; // Convert value to hex number

stringres = stream.str();

return res;

}




int main(int argc, char **argv){

cout << max(0x666, 0x777) << endl;

stringpar = argv[1];

int value;

if (par.substr(0, 2) == "0x") // If there are '0x' characters in front of the parameter,

val = stoi(argv[1], nullptr, 16); // then this is a hex number,

else

val = stoi(argv[1], nullptr, 10); // otherwise - dec-number

cout << max(0x666, val) << endl;

cout << max(0x666, val) << endl;

return 0;

}

 

VS Code

Immediately at the beginning of its execution, the program calls the embedded function, passing it two hexadecimal numbers. As a result, the function returns a larger one converted to hexadecimal format. After that, the main function outputs it to the console.

The next step is to take a command-line parameter. it distinguishes between the numbers of two formats: decimal and hexadecimal, determining them by the absence or presence of a prefix 0x. The next two statements are identical, they make calls to the function maxto which the same parameters are passed both times: 0x666 and a command-line parameter converted from a string to a number. These two consecutive operators, as we did last time, will allow us to trace the function calls.

Along with additional functionality, disassembly listing has increased accordingly. Nevertheless, the essence of what is happening has not changed. In order not to cite it here (it takes up a lot of space), I suggest you deal with it yourself.

Conclusion

The topic "Identification of key structures" is very important, if only because in modern programming languages there are a great many of these structures. And in today's article, we've only just started looking at the features. After all, in addition to the above functions (ordinary, bare, embedded) and the ways to call them (direct call, by pointer, with complex address calculation), there are also virtual, library. In addition, functions include constructors and destructors. But let's not get ahead of ourselves.

Before moving on to object methods, static and virtual functions, you need to learn how to identify starter functions that can occupy a significant part of the disassembler listing, but which do not need to be analyzed (with a few exceptions). Therefore, dear friend, write in the comments to the article what you think about the topic of identification and what constructions you are interested in for analysis.